Closed elinw closed 6 years ago
๐ @elinw Thanks very much for this submission. I will assign an editor shortly and we will then work through the editor checklist before we assign reviewers. We normally strive to get the process started within 5 business days. But we are approaching winter break (I'll be offline all of next week). So in the chance that we are not able to assign an editor/reviewers in the next two days, you'll see an update to this thread in the first week of January.
@elinw thanks for your submission & sorry about the delay! I've seen development is still active, do you consider skimr
stable enough for review? It'd be best not to change too many things while the reviewers are looking.
The tests now pass on my laptop using the latest testthat
version, thanks @jimhester for the help!
He suggested to add testthat (>= 2.0.0)
in DESCRIPTION.
Yes, as of now we have fixed the Windows issue in master. I'm about to push one more change and then I'm going to tag 1.0.1 sometime today. I'll definitely update the testthat minimum first.
And you can confirm that 1.0.1 will be good for review? I even have two reviewers lined up once you confirm that. ๐
Yes it will be.
Thanks again for your submission, if I understand correctly the version for review will be ready today?
Add testthat
minimal version in DESCRIPTION.
You don't need to fill the Maintainer field in DESCRIPTION, it'll get filled from Authors@R
goodpractice
output for you and the reviewers to have in mind
It is good practice to
โ write short and simple functions. These functions have high cyclomatic
complexity:kable.data.frame (60).
โ avoid long code lines, it is bad for readability. Also, many people prefer editor
windows that are about 80 characters wide. Try make your lines shorter than 80
characters
R\skim.R:14:1
R\skim.R:73:1
R\stats.R:28:1
R\stats.R:52:1
R\stats.R:64:1
... and 20 more lines
Two typos identified via devtools::spell_check()
differnt skimr-package.Rd:16
faciliates skim_to_wide.Rd:19
Reviewers: @jimhester @jenniferthompson Due date: 2018-01-30
@jimhester and @jenniferthompson thanks for accepting to review this package! Your reviews are due on the 2018-01-30. ๐บ
You'll find here the review template and our reviewing guide.
Thanks for the quick checks. I have done the easy ones, but there is one issue that we can't fix and two I'd like feedback on.
The complex function one is a function that is brought in from knitr so we can't do anything about that. So in our code it is just
kable.data.frame <- knitr::kable
In terms of long lines, the ones that remain are in tests of the print function so they are the full printed line. If we don't do that we get the tibble wrap which defeats the idea of testing the whole line. On the other hand if we want we could split those strings into two parts and then paste0()
them which would keep the lines under 80 but still give us a correct version that is the full line. Does it make sense to do that? Or does that actually make the tests themselves harder to read?
We have some things that are only run for windows_os and those are the lines showing up without code coverage in good practices. We've excluded them from testing on Travis since the code will never be hit on Linux or Mac. Is it possible to write conditional exclusions so that the tests would run when built on Windows?
You can run covr on appveyor as well, and the coverage reports will be automatically merged by codecov.
Cool, just add the same line to appveyor.yml?
- Rscript -e 'covr::codecov()'
Appveyor needs to use double quotes and you can add it to on_success:
on_success:
- Rscript -e "covr::codecov()"
And I think you may need to add your codecov token to the appveyor settings. You can get the token at https://codecov.io/gh/ropenscilabs/skimr/settings, and add it as CODECOV_TOKEN
at https://ci.appveyor.com/project/ropenscilabs/skimr/settings/environment
Thanks Jim! It seems to be working after https://github.com/ropenscilabs/skimr/pull/243/commits/c34cec75f8bad36669ffe63c69296fb9c1710089
Release 1.0.1 should be in CRAN at this point. @michaelquinn32 is doing some refactoring that should be invisible to end users, and we have plans for updating some tests and documentation (and some feature requests) but other than that things are stable.
@jimhester and @jenniferthompson friendly reminder that your reviews are due on Tuesday next week ๐ธ
@michaelquinn32 @elinw FYI, I'm getting a "Project not found or access denied" error when I try to look at the Appveyor page via clicking the README badge. I'm not experienced in CI, so I'm not sure if that should have been addressed with the fix Jim mentioned a couple of weeks ago.
Also, if I've caught some typos in the vignettes/examples, would y'all like a PR or no?
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).All review comments are based off of version 1.0.1. Overall I feel like the package would be a welcome addition to rOpenSci after the review process is complete.
## Dev mode: on
at the top.devtools::install_github("ropenscilabs/skimr@develop")
(not sure it is
necessary to mention the develop branch at all here)usethis::use_readme_rmd()
is a good way to set this up.kable()
and pander()
, and possibly include
examples of what 'enhanced' options are available and what they look like.Displaying the results of skim()
as seemingly a table, but having the
underlying data as a table in a different format is quite surprising. I pretty
strongly think that you should have two functions, one which returns the
summary information in a wide format, the other in the long format. e.g. From
the print output I would expect to be able to run the following to retrieve
variables in the dataset with standard deviations greater than 1
skim(iris, Sepal.Length, Petal.Length) %>%
filter(sd > 1) %>%
pull(variable)
but this instead throws a pretty confusing error
Error in filter_impl(.data, quo) :
Evaluation error: comparison (6) is possible only for atomic and list types.
Because the data is actually in long format.
skim(iris, Sepal.Length, Petal.Length) %>%
filter(stat == "sd", value > 1) %>%
pull(variable)
Gives you the expected output.
As the print method hides this dichotomy I think this will cause users
confusion which would be best remedied by having two different functions. A
simple solution would be to return the skim_to_wide()
representation from
skim()
. Or maybe instead have a series of functions
Regardless of the exact API I think you should think hard about making it more
obvious the results of skim()
are not simple rectangular tables.
I realize this is somewhat challenging because you have separate output for different column types. Another possible solution which would require minimal API changes is to modify the print method to make the heterogeneous parts more clear.
Designing skim_with()
as a impure function which modifies global state seems
an odd choice when it could instead be written as a functional which returned
skim()
functions with alternate defaults. Beside the impurity concerns it
also makes it more difficult to use different types of skim functions
interchangeably. If you really want the impure interface you could do so by
defining skim_with_defaults()
with the same interface as skim_with()
and an
implementation that simply calls the pure skim_with()
function and assigns
the result as the new default.
Having the 'Using Fonts' vignette be a rmarkdown template is kind of strange, also I think it might be worth listing a few fonts which do support all the features needed for the sparklines / historgrams would be useful.
get_funs() has a very weird implementation, particularly because it's behavior seems equivalent to double square bracket. e.g.
options$functions[["numeric"]]
gives equivalent output to get_func("numeric")
, likewise for get_func("xyz")
(which is NULL
).
The code inherited from pillar should have a comment nearby indicating such, so that you can more easily track upstream changes if the pillar implementation changes in the future.
I really think you should reconsider changing min
and max
to p0
and
p100
. min()
and max()
are self explanitory, but I would need to read the
quantile documentation to be sure that p0
is guaranteed to be the minimum
value in the distribution not subject to rounding error. I would actually
prefer dealing with -Inf
and Inf
in the all NA case to using
quantile(probs = 0)
.
Thanks for your review @jimhester! ๐ธ
This review is of skimr
version 1.0.2, downloaded from Github 2018-01-29. Other relevant info: R version is 3.4.3; devtools version is 1.13.4.
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).Estimated hours spent reviewing: 5
skimr is a very, very useful package that is generally well thought in terms of both simplicity and extensibility. I particularly appreciate a thorough vignette for the main functions, intuitive function names, and the attention to where many users are likely to run into trouble, along with the proactive explanation of potential solutions. skimr does a specific set of tasks and does them simply and well. The source code is easy to digest in both style and function, and the comments are abundant, clear and helpful.
I do agree with Jim that it's confusing to have the printed output be a completely different format from what is output to the console; I'm most excited about (and expect) the wide format printed, vs the long data.frame returned. Perhaps have the default skim()
return a named list, with an element for each data type containing the "wide" format as a data.frame, and have a skim_long
that returns the current long data frame.
Some specific suggestions:
[link to description of this principle]
, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skimr
object which can be included in a pipeline or displayed nicely for the human reader."skim()
- great work showing common use cases and some really helpful functionality. Made it easy to add my own summary statistics. I did find a few typos and can include these in a PR if desired.skim_format(.levels = list(nchar = 4))
is working. I changed the value of this argument several times, and skim(iris)
looked the same every time, both printed and as a data.frame. Could be user error but I couldn't figure out how to make it affect the output.?skim:
group_by
example doesn't actually group by anything (works when I group by Species)skim_tee()
- I'm not totally clear on when it's beneficial.skim(sample(c(1:5, NA), size = 20, replace = TRUE))
), I get a histogram with what appears to be 8 bars.dplyr::cut
, I think) could be clearer - could be a good warning message to include up front.p50
to median
, particularly if you stick with p0
and p100
. (In terms of workflow, I do prefer NAs to -Inf/Inf, though :) )Overall I'm excited to use this package in my own work, and really appreciate the hard and thoughtful work of all the authors.
Thanks a lot for your review @jenniferthompson! ๐
@elinw now both reviews are in!
@elinw I had forgotten to ask you to add the rOpenSci review badge to the README! ๐
[![](https://badges.ropensci.org/175_status.svg)](https://github.com/ropensci/onboarding/issues/175)
@jimhester @jenniferthompson @maelle Thank you so much for the reviews! They are very helpful and for the most part reflect issues @michaelquinn32 have been discussing often with users in the tracker (and Jennifer we'd be thrilled with a PR for typos).
A couple of quick thoughts. @jimhester The fonts thing is tricky because I want to vignette to actually use the fonts and also didn't want to suggest extrafont not put it in the requires list. Any thoughts about how to accomplish that? Also we don't inherit from pillar any more. Hadley told us to just put the code into our codebase since they are rapidly changing APIs and some parts are now internal. Hence Hadley and RStudio are now in the contributors list. I think though that it you're right that it's probably worth adding comments for when and if there are problems. Just thinking that we may also want to send them a note about how we addressed the UTF8 issue.
The NA versus Inf/p0,p50,p100 versus min, median, max issues I think is really interesting. All of us have opinions on "what the user expects" but actually we have no idea, really, what "typical" users expect and it is clear that there are very large portions that expect each. I did actually read the help on all of these so I know that p0 and p100 are the lowest and highest values, it's just that they do not enforce transitivity the way min and max do. Another option is to write a more complex function but that in a way makes the problem of being consistent with the core functions worse.
Lots to think about! Thanks!
Oops @jenniferthompson also mentioned the template (and had the same issue as someone in the issue tracker) so I'm trying to hunt that down. It is usually something goofy when they don't show up.
I think that with the histograms we need to come up with a more structured and probably less elegant approach when there are 2-9 separate values. We were more focused on the too many case than the too few.
I'm not sure what is going on with the help index, on my rstudioserver install from CRAN I have
But definitely don't have that when I build locally.
Some of the items listed are now in master (basic fixes to the readme, make the template work, add a bit to the fonts vignette, fixed the wrong code in the format documentation). @jenniferthompson About Inf values, you're talking about when the data being skimmed has Inf as opposed to when the skim produces it? That's definitely something we need to look at and handle. Probably overall edge case testing needs to b examined. I have a refactoring of tests project I'm working on so I'll include the cases I can think of.
About Inf values, you're talking about when the data being skimmed has Inf
@elinw Yes, exactly! I tried it on one of my datasets; my variable really shouldn't have had Inf
as a value, so it was helpful in that it made me aware of it once I figured out what variable was causing the error. But I was thinking that maybe even more helpful would be either a check up front for Inf values with a more specific error message, or a workaround in the code to avoid the error but an informative warning message.
Well we count NAs so it would potentially make sense to count Inf, -Inf and maybe NaN also.
On Sat, Feb 3, 2018 at 5:20 PM, Jennifer Thompson notifications@github.com wrote:
About Inf values, you're talking about when the data being skimmed has Inf
@elinw https://github.com/elinw Yes, exactly! I tried it on one of my datasets; my variable really shouldn't have had Inf as a value, so it was helpful in that it made me aware of it once I figured out what variable was causing the error. But I was thinking that maybe even more helpful would be either a check up front for Inf values with a more specific error message, or a workaround in the code to avoid the error but an informative warning message.
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/175#issuecomment-362859357, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuEfS1JgmG9muJXxhO_aQWf6Xaowjgjks5tRNvGgaJpZM4RIge_ .
Thanks for the reviews everyone! I've tried to reduce it to a set of issues that we can start tracking. Please let me know if I missed anything:
Fix readme. The bullet points are clear.
Coalesce skim api around skim_long (aliased skim), skim_wide and skim list. The latter two should be support downstream computation, and not just the rendered values. This is going to be a bit tricking for stats with multiple levels, since we normally collapse them in the rendered output.
skim_long is considered the default because it is the most terse representation of relatively heterogeneous data. For any df with more than one data type, skim_wide will be very sparse. Jim's example is going to give one or more rows with lots of NA values where statistics weren't computed.
Skimming to a list is nice for some work but it doesn't play nice with magrittr/dplyr pipes. The other alternative, using nested data frames as a list within a single skimmed data frame, isn't a great idea until dplyr better supports accessing nested data.
x <- data.frame(id = 1:3, listcol = I(replicate(3, data.frame(a = 1, b =2), simplify = FALSE)))
dplyr::select(x, listcol$a)
# Throws an error
-- Push back
The case that you gave above didn't work for us when we ran into ordered factors.
y <- ordered(letters)
class(y)
#> [1] "ordered" "factor"
z <- list(factor = "blue")
z[[class(y)]]
#> Error in z[[class(y)]] : no such index at level 1
max_char = 3
and max_levels = 4
. All apologies if the documentation suggested otherwise.We've fixed a bunch of these issues,. @michaelquinn32 about the levels issue, the way it can work is really what you get when you go wide (spread) from the long, because the levels are already separate rows in the long form. I'm just wondering if it makes sense to think about why people want to do calculations, if indeed that is what they want to do. I think some people want the formatted because they are working on display, while other people want the numeric because they want to do things like filter based on sd or having non missing cases etc. Before we start implementing something we should think about what the ultimate goal is. That's why the question of doing selective skimming for a particular type is important too, because what does it mean when you have a mix of types? By the way the same whole list of concerns applies to many summary.x() methods so its good to try to think this through especially about what is not satisfying in summary() and its associated print methods. Just as an example, the print()
for summary(lm(y~x)) really does not reflect what is in the summary object. We could even consider a complex object like that.
@elinw @michaelquinn32 thanks both for your work! ๐ธ
Please create related issues in skimr
repo and report back here your answers to reviewers and the specific questions you may have for them once you're ready. ๐
I have a question for @jimhester, I did try moving the Rmd out of the docs folder to the top level (it's that way in develop right now) but I really feel it makes the top level a bit messy there since it doesn't really serve any purpose for github users to have it there (unlike all the rd files).
Okay I'm starting a list of issues from onboarding comments. I'll keep adding them here until I get a full list.
Having the Rmd in the top level is easier for contributors, as that is the standard convention for where to put it in R projects. Because you need to edit the *.Rmd
file rather than the .md
having both in the same directory makes this more obvious. For instance I had no idea the Rmd was even in version control originally in skimr. If you do want to include the Rmd in a different directory I would encourage you to have a comment in the md file mentioning this location, e.g. https://github.com/tidyverse/glue/blame/master/README.md#L2
An update.
As of today I think the easy 3 (that wouldn't involve API changes) have all been addressed. @michaelquinn32 has been working on a discussion paper about the API issues so that we can have a clear plan moving toward skimr 2.0 and our user community can chime in. That should be ready to share soon.
Thanks @elinw and @michaelquinn32 for the update and your work!
@jimhester In re-reading this morning I noticed this commen "Another possible solution which would require minimal API changes is to modify the print method to make the heterogeneous parts more
clear." That's something that we should be able to do with markup I think as an interim while we are deciding what to do with the larger API issue.
@jenniferthompson I still also need to look at the fewer than 8 bins issue.
Okay so we released 1.0.2 with all the bug and documentation fixes so far. I'm going to look at this possibility of using markup to make the display more clear to people.
Thanks for the update @elinw and congrats on the release! ๐ I guess you'll soon write the "final" answer to reviewers then?
@maelle That's the plan. I wanted to add that we now have in the develop branch a solution to the other issue that I think folks here and in our tracker have wanted. That is now you can define your own complete list of functions with a name and use skim_with() to append or replace the default skimmers. So the way I think about this the use case would be that you have several different lists that you use for different situations e.g. my_funs1, my_funs2, my_funs3. Then you still have to do two steps but it is e.g.
skim_with(.list = my_funs1, append = FALSE)
skim(x)
skim_with(.list = my_funs2, append = FALSE)
skim(y)
skim_with(!!!my_funs3, append=FALSE)
skim_with_defaults()
I think if this were part of your usual workflow you might write your own functions so you don't need to do that append=FALSE
and .list=
.
There's also a PR making it so that if you define an empty list of functions for a type that type is removed from the analysis.
Cool work! ๐ธ Thanks for the update.
@elinw Any update? ๐บ
@elinw @michaelquinn32 any update? โบ
Hi,
Yes I think the development branch pretty much has addressed all, I just wanted to put in Jennyโs suggestion to make the display a bit nicer. Iโll do that once I recover from the unconfirmed! And then I think we could call it ready and everything else in terms of possibly changing the API etc will be for 2.0.
On May 22, 2018, at 1:32 PM, Maรซlle Salmon notifications@github.com wrote:
@elinw https://github.com/elinw @michaelquinn32 https://github.com/michaelquinn32 any update? โบ
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/175#issuecomment-391131054, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuEfd3bjl5B7xuc5tO8gmbV3rr7HvyJks5t1HX7gaJpZM4RIge_.
Ok thanks. Don't hesitate to notify the reviewers in person ๐
Okay here's an update @jimhester @jenniferthompson @michaelquinn32
As of now all changes have been merged into the master branch, and I'm planning to do a release of 1.0.3 because I got a message from Brian Ripley about a typo in a test (I can't believe that all the eyes that looked at this before missed it).
I think we have responded to all of the issues except the possibility of changing the API. Michael has written a road map for skimr 2.0 that addresses that along with some architectural issues. But I think over the course of the discussion here and in the skimr repository we all came to agreement that cosmetic changes that make the fact that the skim print results do not represent a single data frame more clear would be okay for skimr 1.
Changes for 2.0 are in this doc. https://docs.google.com/document/d/18lBStDZzd1rJq08O-4Sw2qHhuHEZ79QX4sBkeyzWNFY/edit?usp=sharing
I'm writing tests for the next version right now.
Thanks for the updates @elinw and @michaelquinn32, very cool to see such active development!
@jimhester and @jenniferthompson , are you happy with the current version of skimr
?
LGTM
Very likely! I'm traveling and won't have time to look at the latest version till next week, but will do asap.
Thanks @jimhester!
@jenniferthompson, thank you, we'll wait for your input next week, happy travels!
Summary
What does this package do? (explain in 50 words or less): Creates compact and flexible data summaries that are pipeable and display nicely in the console. For a given data frame or vector the skim function provides a useful set of summary statistics (based on the class of the individual vector/column) that allows users to skim their data to get an overall sense of what is included, extent of missing values, and similar information. The skim generic can be extended by users to other data structures.
Paste the full DESCRIPTION file inside a code block below:
URL for the package (the development repository, not a stylized html page): https://github.com/ropenscilabs/skimr
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
"tools" Because skimr provides users with a way to get started with a new, unknown data set (by getting a quick overview (or skim) of the data that is readable and compact. It also serves as a good tool for reporting summary information about data.
ย Who is the target audience and what are scientific applications of this package? ย
People who want to get a quick overview of a data set while initially reviewing, iteratively cleaning, or sharing with others.
Beginning statistics students who need simple access to a set of basic descriptive statistics.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
This is intended as an improved version of the r core summary functions available. Like summary, skim is a generic that users can extend to any R data object. It is designed to be somewhat like a more flexible version of fivenums() from
stats
or favstats() from the Mosaic package.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
[x] Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names: