Skimr - Githubissues

elinw commented 6 years ago

Summary

What does this package do? (explain in 50 words or less): Creates compact and flexible data summaries that are pipeable and display nicely in the console. For a given data frame or vector the skim function provides a useful set of summary statistics (based on the class of the individual vector/column) that allows users to skim their data to get an overall sense of what is included, extent of missing values, and similar information. The skim generic can be extended by users to other data structures.
Paste the full DESCRIPTION file inside a code block below:

Package: skimr
Title: Compact and Flexible Summaries of Data 
Version: 0.9.92
Authors@R: c(
  person("Amelia", "McNamara", email="amcnamara@smith.edu", role = "aut"),
  person("Eduardo", "Arino de la Rubia", email="earino@gmail.com", role = "aut"),
  person("Hao", "Zhu", email="haozhu233@gmail.com", role = "aut"),
  person("Julia", "Lowndes", email="lowndes@nceas.ucsb.edu", role = 'ctb'),
  person("Shannon", "Ellis", email="sellis18@jhmi.edu", role = "aut"),
  person("Elin", "Waring", email="elin.waring@gmail.com", role = "cre"),
  person("Michael", "Quinn", email="msquinn2@illinois.edu", role = "aut"),
  person("Hope", "McLeod", email="hmgit2@gmail.com", role = 'ctb'),
  person("Hadley", "Wickham", email="hadley@rstudio.com", role = 'ctb'),
  person("Connor", "Kirkpatrick", email="hello@connorkirkpatrick.com", role = 'ctb')
  )
Maintainer: Elin Waring <elin.waring@gmail.com>
Description: A simple to use summary function that can be used with pipes
    and displays nicely in the console. The default summary statistics may be 
    modified by the user as can the default formatting. Support for data frames 
    and vectors is included, and users can implement their own skim methods for
    specific object types as described in a vignette. Default summaries include
    support for inline spark graphs. Instructions for managing these on 
    specific operating systems are given in the Using skimr vignette and the 
    README.
Depends:
    R (>= 3.1.2)
Imports:
    dplyr (>= 0.7),
    magrittr,
    pander,
    purrr,
    rlang,
    stats,
    stringr,
    knitr,
    tibble (>= 0.6),
    tidyr (>= 0.7),
    tidyselect
Suggests:
    extrafont,
    rmarkdown,
    testthat,
    withr
License: GPL-3 + file LICENSE
Encoding: UTF-8
LazyData: true
URL: https://github.com/ropenscilabs/skimr
BugReports: https://github.com/ropenscilabs/skimr/issues
VignetteBuilder: knitr
RoxygenNote: 6.0.1
Collate: 
    'skimr-package.R'
    'formats.R'
    'stats.R'
    'functions.R'
    'options.R'
    'skim.R'
    'skim_print.R'
    'skim_v.R'
    'summary.R'

URL for the package (the development repository, not a stylized html page): https://github.com/ropenscilabs/skimr
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

"tools" Because skimr provides users with a way to get started with a new, unknown data set (by getting a quick overview (or skim) of the data that is readable and compact. It also serves as a good tool for reporting summary information about data.

Who is the target audience and what are scientific applications of this package?
People who want to get a quick overview of a data set while initially reviewing, iteratively cleaning, or sharing with others.
Beginning statistics students who need simple access to a set of basic descriptive statistics.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

This is intended as an improved version of the r core summary functions available. Like summary, skim is a generic that users can extend to any R data object. It is designed to be somewhat like a more flexible version of fivenums() from stats or favstats() from the Mosaic package.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [ ] The package has an obvious research application according to JOSS's definition.
- [ ] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [ ] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal.
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

karthik commented 6 years ago

👋 @elinw Thanks very much for this submission. I will assign an editor shortly and we will then work through the editor checklist before we assign reviewers. We normally strive to get the process started within 5 business days. But we are approaching winter break (I'll be offline all of next week). So in the chance that we are not able to assign an editor/reviewers in the next two days, you'll see an update to this thread in the first week of January.

maelle commented 6 years ago

@elinw thanks for your submission & sorry about the delay! I've seen development is still active, do you consider skimr stable enough for review? It'd be best not to change too many things while the reviewers are looking.

maelle commented 6 years ago

The tests now pass on my laptop using the latest testthat version, thanks @jimhester for the help!

He suggested to add testthat (>= 2.0.0) in DESCRIPTION.

elinw commented 6 years ago

Yes, as of now we have fixed the Windows issue in master. I'm about to push one more change and then I'm going to tag 1.0.1 sometime today. I'll definitely update the testthat minimum first.

maelle commented 6 years ago

And you can confirm that 1.0.1 will be good for review? I even have two reviewers lined up once you confirm that. 😄

elinw commented 6 years ago

Yes it will be.

maelle commented 6 years ago

Editor checks:

[x ] Fit: The package meets criteria for fit and overlap
[ x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
[ x] License: The package has a CRAN or OSI accepted license
[ x] Repository: The repository link resolves correctly
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks again for your submission, if I understand correctly the version for review will be ready today?

Add testthat minimal version in DESCRIPTION.
You don't need to fill the Maintainer field in DESCRIPTION, it'll get filled from Authors@R

goodpractice output for you and the reviewers to have in mind

It is good practice to

  ✖ write short and simple functions. These functions have high cyclomatic
    complexity:kable.data.frame (60).
  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor
    windows that are about 80 characters wide. Try make your lines shorter than 80
    characters

    R\skim.R:14:1
    R\skim.R:73:1
    R\stats.R:28:1
    R\stats.R:52:1
    R\stats.R:64:1
    ... and 20 more lines

Two typos identified via devtools::spell_check()

differnt      skimr-package.Rd:16
faciliates    skim_to_wide.Rd:19

Reviewers: @jimhester @jenniferthompson Due date: 2018-01-30

maelle commented 6 years ago

@jimhester and @jenniferthompson thanks for accepting to review this package! Your reviews are due on the 2018-01-30. 😺

You'll find here the review template and our reviewing guide.

elinw commented 6 years ago

Thanks for the quick checks. I have done the easy ones, but there is one issue that we can't fix and two I'd like feedback on.

The complex function one is a function that is brought in from knitr so we can't do anything about that. So in our code it is just kable.data.frame <- knitr::kable
In terms of long lines, the ones that remain are in tests of the print function so they are the full printed line. If we don't do that we get the tibble wrap which defeats the idea of testing the whole line. On the other hand if we want we could split those strings into two parts and then paste0() them which would keep the lines under 80 but still give us a correct version that is the full line. Does it make sense to do that? Or does that actually make the tests themselves harder to read?
We have some things that are only run for windows_os and those are the lines showing up without code coverage in good practices. We've excluded them from testing on Travis since the code will never be hit on Linux or Mac. Is it possible to write conditional exclusions so that the tests would run when built on Windows?

jimhester commented 6 years ago

You can run covr on appveyor as well, and the coverage reports will be automatically merged by codecov.

elinw commented 6 years ago

Cool, just add the same line to appveyor.yml? - Rscript -e 'covr::codecov()'

jimhester commented 6 years ago

Appveyor needs to use double quotes and you can add it to on_success:

on_success:
  - Rscript -e "covr::codecov()"

And I think you may need to add your codecov token to the appveyor settings. You can get the token at https://codecov.io/gh/ropenscilabs/skimr/settings, and add it as CODECOV_TOKEN at https://ci.appveyor.com/project/ropenscilabs/skimr/settings/environment

michaelquinn32 commented 6 years ago

Thanks Jim! It seems to be working after https://github.com/ropenscilabs/skimr/pull/243/commits/c34cec75f8bad36669ffe63c69296fb9c1710089

elinw commented 6 years ago

Release 1.0.1 should be in CRAN at this point. @michaelquinn32 is doing some refactoring that should be invisible to end users, and we have plans for updating some tests and documentation (and some feature requests) but other than that things are stable.

maelle commented 6 years ago

@jimhester and @jenniferthompson friendly reminder that your reviews are due on Tuesday next week 😸

jenniferthompson commented 6 years ago

@michaelquinn32 @elinw FYI, I'm getting a "Project not found or access denied" error when I try to look at the Appveyor page via clicking the README badge. I'm not experienced in CI, so I'm not sure if that should have been addressed with the fix Jim mentioned a couple of weeks ago.

jenniferthompson commented 6 years ago

Also, if I've caught some typos in the vignettes/examples, would y'all like a PR or no?

jimhester commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Review Comments

All review comments are based off of version 1.0.1. Overall I feel like the package would be a welcome addition to rOpenSci after the review process is complete.

README

Get rid of ## Dev mode: on at the top.
devel installation instructions better written as devtools::install_github("ropenscilabs/skimr@develop") (not sure it is necessary to mention the develop branch at all here)
A markdown file should not have a yaml header and this file should really be built from a README.Rmd included in the repository. usethis::use_readme_rmd() is a good way to set this up.
Should have links to kable() and pander(), and possibly include examples of what 'enhanced' options are available and what they look like.

skim

Displaying the results of skim() as seemingly a table, but having the underlying data as a table in a different format is quite surprising. I pretty strongly think that you should have two functions, one which returns the summary information in a wide format, the other in the long format. e.g. From the print output I would expect to be able to run the following to retrieve variables in the dataset with standard deviations greater than 1

skim(iris, Sepal.Length, Petal.Length) %>%
  filter(sd > 1) %>%
  pull(variable)

but this instead throws a pretty confusing error

Error in filter_impl(.data, quo) :
  Evaluation error: comparison (6) is possible only for atomic and list types.

Because the data is actually in long format.

skim(iris, Sepal.Length, Petal.Length) %>%
  filter(stat == "sd", value > 1) %>%
  pull(variable)

Gives you the expected output.

As the print method hides this dichotomy I think this will cause users confusion which would be best remedied by having two different functions. A simple solution would be to return the skim_to_wide() representation from skim(). Or maybe instead have a series of functions

skim_wide()
skim_long()
skim_list()

Regardless of the exact API I think you should think hard about making it more obvious the results of skim() are not simple rectangular tables.

I realize this is somewhat challenging because you have separate output for different column types. Another possible solution which would require minimal API changes is to modify the print method to make the heterogeneous parts more clear.

skim_with

Designing skim_with() as a impure function which modifies global state seems an odd choice when it could instead be written as a functional which returned skim() functions with alternate defaults. Beside the impurity concerns it also makes it more difficult to use different types of skim functions interchangeably. If you really want the impure interface you could do so by defining skim_with_defaults() with the same interface as skim_with() and an implementation that simply calls the pure skim_with() function and assigns the result as the new default.

Vignettes

Having the 'Using Fonts' vignette be a rmarkdown template is kind of strange, also I think it might be worth listing a few fonts which do support all the features needed for the sparklines / historgrams would be useful.

Miscellanea

get_funs() has a very weird implementation, particularly because it's behavior seems equivalent to double square bracket. e.g. options$functions[["numeric"]] gives equivalent output to get_func("numeric"), likewise for get_func("xyz") (which is NULL).

The code inherited from pillar should have a comment nearby indicating such, so that you can more easily track upstream changes if the pillar implementation changes in the future.

I really think you should reconsider changing min and max to p0 and p100. min() and max() are self explanitory, but I would need to read the quantile documentation to be sure that p0 is guaranteed to be the minimum value in the distribution not subject to rounding error. I would actually prefer dealing with -Inf and Inf in the all NA case to using quantile(probs = 0).

maelle commented 6 years ago

Thanks for your review @jimhester! 😸

jenniferthompson commented 6 years ago

skimr Package Review

Jennifer Thompson

This review is of skimr version 1.0.2, downloaded from Github 2018-01-29. Other relevant info: R version is 3.4.3; devtools version is 1.13.4.

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work.

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally Yes, but see below
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5

Review Comments

skimr is a very, very useful package that is generally well thought in terms of both simplicity and extensibility. I particularly appreciate a thorough vignette for the main functions, intuitive function names, and the attention to where many users are likely to run into trouble, along with the proactive explanation of potential solutions. skimr does a specific set of tasks and does them simply and well. The source code is easy to digest in both style and function, and the comments are abundant, clear and helpful.

I do agree with Jim that it's confusing to have the printed output be a completely different format from what is output to the console; I'm most excited about (and expect) the wide format printed, vs the long data.frame returned. Perhaps have the default skim() return a named list, with an element for each data type containing the "wide" format as a data.frame, and have a skim_long that returns the current long data frame.

Some specific suggestions:

README:
- Clarify first bullet - "added missing, complete, n, sd" to what?
- Suggested edit for brevity to statement of need: "skimr provides a frictionless approach to summary statistics which conforms to the principle of 'least surprise' [link to description of this principle], displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skimr object which can be included in a pipeline or displayed nicely for the human reader."
- Remove YAML output at the top.
Really helpful vignette for skim() - great work showing common use cases and some really helpful functionality. Made it easy to add my own summary statistics. I did find a few typos and can include these in a PR if desired.
Examples in help files:
- I'm not sure skim_format(.levels = list(nchar = 4)) is working. I changed the value of this argument several times, and skim(iris) looked the same every time, both printed and as a data.frame. Could be user error but I couldn't figure out how to make it affect the output.
- ?skim: group_by example doesn't actually group by anything (works when I group by Species)
- I'm not clear on the benefit of having the fonts vignette as an RMarkdown template. Also, it's not showing up as an option for me in RStudio (I only have HTML vignette and Github document, even with skimr loaded).
- Would be helpful to show an example of skim_tee() - I'm not totally clear on when it's beneficial.
The inline histograms are really helpful for truly continuous variables. Is it possible to make them smarter/more flexible? When I skim a numeric discrete variable with 5 levels (skim(sample(c(1:5, NA), size = 20, replace = TRUE))), I get a histogram with what appears to be 8 bars.
Community guidelines: Provide links to contributing.md, Google style guide and CONDUCT.md.
The error message when a variable has infinite values (produced via dplyr::cut, I think) could be clearer - could be a good warning message to include up front.
I'd prefer p50 to median, particularly if you stick with p0 and p100. (In terms of workflow, I do prefer NAs to -Inf/Inf, though :) )

Overall I'm excited to use this package in my own work, and really appreciate the hard and thoughtful work of all the authors.

maelle commented 6 years ago

Thanks a lot for your review @jenniferthompson! 😃

@elinw now both reviews are in!

maelle commented 6 years ago

@elinw I had forgotten to ask you to add the rOpenSci review badge to the README! 🙈

[![](https://badges.ropensci.org/175_status.svg)](https://github.com/ropensci/onboarding/issues/175)

elinw commented 6 years ago

@jimhester @jenniferthompson @maelle Thank you so much for the reviews! They are very helpful and for the most part reflect issues @michaelquinn32 have been discussing often with users in the tracker (and Jennifer we'd be thrilled with a PR for typos).

A couple of quick thoughts. @jimhester The fonts thing is tricky because I want to vignette to actually use the fonts and also didn't want to suggest extrafont not put it in the requires list. Any thoughts about how to accomplish that? Also we don't inherit from pillar any more. Hadley told us to just put the code into our codebase since they are rapidly changing APIs and some parts are now internal. Hence Hadley and RStudio are now in the contributors list. I think though that it you're right that it's probably worth adding comments for when and if there are problems. Just thinking that we may also want to send them a note about how we addressed the UTF8 issue.

The NA versus Inf/p0,p50,p100 versus min, median, max issues I think is really interesting. All of us have opinions on "what the user expects" but actually we have no idea, really, what "typical" users expect and it is clear that there are very large portions that expect each. I did actually read the help on all of these so I know that p0 and p100 are the lowest and highest values, it's just that they do not enforce transitivity the way min and max do. Another option is to write a more complex function but that in a way makes the problem of being consistent with the core functions worse.

Lots to think about! Thanks!

elinw commented 6 years ago

Oops @jenniferthompson also mentioned the template (and had the same issue as someone in the issue tracker) so I'm trying to hunt that down. It is usually something goofy when they don't show up.

I think that with the histograms we need to come up with a more structured and probably less elegant approach when there are 2-9 separate values. We were more focused on the too many case than the too few.

I'm not sure what is going on with the help index, on my rstudioserver install from CRAN I have screen shot 2018-01-31 at 12 17 58 pm

screen shot 2018-01-31 at 12 18 08 pm

But definitely don't have that when I build locally.

elinw commented 6 years ago

Some of the items listed are now in master (basic fixes to the readme, make the template work, add a bit to the fonts vignette, fixed the wrong code in the format documentation). @jenniferthompson About Inf values, you're talking about when the data being skimmed has Inf as opposed to when the skim produces it? That's definitely something we need to look at and handle. Probably overall edge case testing needs to b examined. I have a refactoring of tests project I'm working on so I'll include the cases I can think of.

jenniferthompson commented 6 years ago

About Inf values, you're talking about when the data being skimmed has Inf

@elinw Yes, exactly! I tried it on one of my datasets; my variable really shouldn't have had Inf as a value, so it was helpful in that it made me aware of it once I figured out what variable was causing the error. But I was thinking that maybe even more helpful would be either a check up front for Inf values with a more specific error message, or a workaround in the code to avoid the error but an informative warning message.

elinw commented 6 years ago

Well we count NAs so it would potentially make sense to count Inf, -Inf and maybe NaN also.

On Sat, Feb 3, 2018 at 5:20 PM, Jennifer Thompson notifications@github.com wrote:

About Inf values, you're talking about when the data being skimmed has Inf

@elinw https://github.com/elinw Yes, exactly! I tried it on one of my datasets; my variable really shouldn't have had Inf as a value, so it was helpful in that it made me aware of it once I figured out what variable was causing the error. But I was thinking that maybe even more helpful would be either a check up front for Inf values with a more specific error message, or a workaround in the code to avoid the error but an informative warning message.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/175#issuecomment-362859357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuEfS1JgmG9muJXxhO_aQWf6Xaowjgjks5tRNvGgaJpZM4RIge_ .

michaelquinn32 commented 6 years ago

Thanks for the reviews everyone! I've tried to reduce it to a set of issues that we can start tracking. Please let me know if I missed anything:

Fix readme. The bullet points are clear.
- Get rid of ## Dev mode: on at the top.
- devel installation instructions better written as devtools::install_github("ropenscilabs/skimr@develop") (not sure it is necessary to mention the develop branch at all here)
- A markdown file should not have a yaml header and this file should really be built from a README.Rmd included in the repository. usethis::use_readme_rmd() is a good way to set this up.
- Should have links to kable() and pander(), and possibly include examples of what 'enhanced' options are available and what they look like.
- Clarify first bullet - "added missing, complete, n, sd" to what?
- Suggested edit for brevity to statement of need: "skimr provides a frictionless approach to summary statistics which conforms to the principle of 'least surprise' [link to description of this principle], displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skimr object which can be included in a pipeline or displayed nicely for the human reader."
Coalesce skim api around skim_long (aliased skim), skim_wide and skim list. The latter two should be support downstream computation, and not just the rendered values. This is going to be a bit tricking for stats with multiple levels, since we normally collapse them in the rendered output.

skim_long is considered the default because it is the most terse representation of relatively heterogeneous data. For any df with more than one data type, skim_wide will be very sparse. Jim's example is going to give one or more rows with lots of NA values where statistics weren't computed.

Skimming to a list is nice for some work but it doesn't play nice with magrittr/dplyr pipes. The other alternative, using nested data frames as a list within a single skimmed data frame, isn't a great idea until dplyr better supports accessing nested data.

x <- data.frame(id = 1:3, listcol = I(replicate(3, data.frame(a = 1, b =2), simplify = FALSE)))
dplyr::select(x, listcol$a)
# Throws an error

Support something in the print that talks about the returned data object being long, wide or a list.
(short-term) Rewrite skim_with_defaults() to use the same interface as skim_with(). Same for formatting functions.
(long-term) Rewrite skim_with to support a functional implementation that does not depend on a global state.
Better identify code from pillar. Link to it.
Update get_funs documentation
Fix group_by in skim docs (done). Thanks for pointing that out!
Add a skim_tee example
Support inline histograms for vectors with less than the default number of characters.
Link to contributing.md, tidyverse style guide (replacing Google Style guide) and CONDUCT.md.
Rename median to p50, compute using quantile for consistency.

-- Push back

get_funs is vectorized, to work with S3 inheritance, which is just a vector of different classes. TBH this is all a little hacky, since we're basically reimplementing S3, but I would love a better idea that allows for dynamic registration of types.

The case that you gave above didn't work for us when we ran into ordered factors.

y <- ordered(letters)
class(y)
#> [1] "ordered" "factor"
z <- list(factor = "blue")
z[[class(y)]]
#> Error in z[[class(y)]] : no such index at level 1

The default value of min for summary is also P0. It uses quantile to produce a five-number summary. I think it's appropriate to go this route as long as the returned statistics are consistently named, which they are. https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/base/R/summary.R#L42
Vignettes, +1 to what Elin already said.
For formatting the acceptable values for the .levels options are max_char = 3 and max_levels = 4. All apologies if the documentation suggested otherwise.

elinw commented 6 years ago

We've fixed a bunch of these issues,. @michaelquinn32 about the levels issue, the way it can work is really what you get when you go wide (spread) from the long, because the levels are already separate rows in the long form. I'm just wondering if it makes sense to think about why people want to do calculations, if indeed that is what they want to do. I think some people want the formatted because they are working on display, while other people want the numeric because they want to do things like filter based on sd or having non missing cases etc. Before we start implementing something we should think about what the ultimate goal is. That's why the question of doing selective skimming for a particular type is important too, because what does it mean when you have a mix of types? By the way the same whole list of concerns applies to many summary.x() methods so its good to try to think this through especially about what is not satisfying in summary() and its associated print methods. Just as an example, the print() for summary(lm(y~x)) really does not reflect what is in the summary object. We could even consider a complex object like that.

maelle commented 6 years ago

@elinw @michaelquinn32 thanks both for your work! 😸

Please create related issues in skimr repo and report back here your answers to reviewers and the specific questions you may have for them once you're ready. 😉

elinw commented 6 years ago

I have a question for @jimhester, I did try moving the Rmd out of the docs folder to the top level (it's that way in develop right now) but I really feel it makes the top level a bit messy there since it doesn't really serve any purpose for github users to have it there (unlike all the rd files).

elinw commented 6 years ago

Okay I'm starting a list of issues from onboarding comments. I'll keep adding them here until I get a full list.

https://github.com/ropenscilabs/skimr/issues/286 is for everything related to README
- https://github.com/ropenscilabs/skimr/issues/288 this is for everything related to documentation
https://github.com/ropenscilabs/skimr/issues/289 percentiles issue
- https://github.com/ropenscilabs/skimr/issues/290 Api (long, wide, list etc) issue
https://github.com/ropenscilabs/skimr/issues/292 skim_with + being more functional

jimhester commented 6 years ago

Having the Rmd in the top level is easier for contributors, as that is the standard convention for where to put it in R projects. Because you need to edit the *.Rmd file rather than the .md having both in the same directory makes this more obvious. For instance I had no idea the Rmd was even in version control originally in skimr. If you do want to include the Rmd in a different directory I would encourage you to have a comment in the md file mentioning this location, e.g. https://github.com/tidyverse/glue/blame/master/README.md#L2

elinw commented 6 years ago

An update.

As of today I think the easy 3 (that wouldn't involve API changes) have all been addressed. @michaelquinn32 has been working on a discussion paper about the API issues so that we can have a clear plan moving toward skimr 2.0 and our user community can chime in. That should be ready to share soon.

maelle commented 6 years ago

Thanks @elinw and @michaelquinn32 for the update and your work!

elinw commented 6 years ago

@jimhester In re-reading this morning I noticed this commen "Another possible solution which would require minimal API changes is to modify the print method to make the heterogeneous parts more clear." That's something that we should be able to do with markup I think as an interim while we are deciding what to do with the larger API issue.
@jenniferthompson I still also need to look at the fewer than 8 bins issue.

elinw commented 6 years ago

Okay so we released 1.0.2 with all the bug and documentation fixes so far. I'm going to look at this possibility of using markup to make the display more clear to people.

maelle commented 6 years ago

Thanks for the update @elinw and congrats on the release! 🎊 I guess you'll soon write the "final" answer to reviewers then?

elinw commented 6 years ago

@maelle That's the plan. I wanted to add that we now have in the develop branch a solution to the other issue that I think folks here and in our tracker have wanted. That is now you can define your own complete list of functions with a name and use skim_with() to append or replace the default skimmers. So the way I think about this the use case would be that you have several different lists that you use for different situations e.g. my_funs1, my_funs2, my_funs3. Then you still have to do two steps but it is e.g.

skim_with(.list = my_funs1, append = FALSE)
skim(x)
skim_with(.list = my_funs2, append = FALSE)
skim(y)
skim_with(!!!my_funs3, append=FALSE)
skim_with_defaults()

I think if this were part of your usual workflow you might write your own functions so you don't need to do that append=FALSE and .list=.

There's also a PR making it so that if you define an empty list of functions for a type that type is removed from the analysis.

maelle commented 6 years ago

Cool work! 😸 Thanks for the update.

maelle commented 6 years ago

@elinw Any update? 😺

maelle commented 6 years ago

@elinw @michaelquinn32 any update? ☺

elinw commented 6 years ago

Hi,

Yes I think the development branch pretty much has addressed all, I just wanted to put in Jenny’s suggestion to make the display a bit nicer. I’ll do that once I recover from the unconfirmed! And then I think we could call it ready and everything else in terms of possibly changing the API etc will be for 2.0.

On May 22, 2018, at 1:32 PM, Maëlle Salmon notifications@github.com wrote:

@elinw https://github.com/elinw @michaelquinn32 https://github.com/michaelquinn32 any update? ☺

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/175#issuecomment-391131054, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuEfd3bjl5B7xuc5tO8gmbV3rr7HvyJks5t1HX7gaJpZM4RIge_.

maelle commented 6 years ago

Ok thanks. Don't hesitate to notify the reviewers in person 😉

elinw commented 6 years ago

Okay here's an update @jimhester @jenniferthompson @michaelquinn32
As of now all changes have been merged into the master branch, and I'm planning to do a release of 1.0.3 because I got a message from Brian Ripley about a typo in a test (I can't believe that all the eyes that looked at this before missed it).
I think we have responded to all of the issues except the possibility of changing the API. Michael has written a road map for skimr 2.0 that addresses that along with some architectural issues. But I think over the course of the discussion here and in the skimr repository we all came to agreement that cosmetic changes that make the fact that the skim print results do not represent a single data frame more clear would be okay for skimr 1.

michaelquinn32 commented 6 years ago

Changes for 2.0 are in this doc. https://docs.google.com/document/d/18lBStDZzd1rJq08O-4Sw2qHhuHEZ79QX4sBkeyzWNFY/edit?usp=sharing

I'm writing tests for the next version right now.

maelle commented 6 years ago

Thanks for the updates @elinw and @michaelquinn32, very cool to see such active development!

@jimhester and @jenniferthompson , are you happy with the current version of skimr?

jimhester commented 6 years ago

LGTM

jenniferthompson commented 6 years ago

Very likely! I'm traveling and won't have time to look at the latest version till next week, but will do asap.

maelle commented 6 years ago

Thanks @jimhester!

@jenniferthompson, thank you, we'll wait for your input next week, happy travels!

ropensci / software-review

Skimr #175

Summary

Requirements

Publication options

Detail

Editor checks:

Editor comments

Package Review

Documentation

Functionality

Final approval (post-review)

Review Comments

README

skim

skim_with

Vignettes

Miscellanea

skimr Package Review

Jennifer Thompson

Documentation

Functionality

Final approval (post-review)

Review Comments