Skim function - Githubissues

elinw commented 4 years ago

Hi,

Congratulations on the beautiful package!

We ( @michaelquinn32 and I) noticed your datasummary_skim() function the output of which looks a lot like that for skimr::skim(). Was it inspired by our package? It's pretty complex to maintain the package because of things like Windows not supporting UTF-8 in data frames which makes the histograms impossible. It might be simpler to use skimr directly. Would you be interested in a PR to support a customized skimmer using skimr? Did you run into an issue trying to do that?

vincentarelbundock commented 4 years ago

@elinw thanks so much for opening this issue! I meant to reach out, but then life got crazy with work and covid homeschool, and I totally forgot. Sorry!

Your intuition is totally right. Here's a screenshot from the modelsummary website:

Although I did not copy your code and I figured out the unicode strategy on my own, duplicating some of the skimr functionality really came from fandom. From that perspective, I want to state at the outset that I am more than happy to make changes if you feel uncomfortable with my current approach (e.g., change the function name).

Using skimr directly sounds like a very interesting idea! Honestly, I didn't think through many of the complications, and I don't have a Windows box to test on. Right now, I expect the spark histogram to fail when:

Output is compiled to LaTeX
Display font does not support unicode bar characters
Windows machines, unless users try some of the weird locale tricks you suggest on the website

Would using skimr get me any of those for "free"? Will it prevent other problems I haven't thought about?

For me, the main cost of using skimr would be to add yet another dependency (on Github, I moved ggplot2 to suggests, and am planning to further reduce my reliance on external packages). This is mainly a problem if the skimr API changes. How stable do you think the API is likely to be?

Aside from that, it doesn't look like skimr would add that many upstream dependencies:

library(miniCRAN)
mdep <- c('broom', 'checkmate', 'dplyr', 'kableExtra', 'knitr', 'magrittr', 'rmarkdown', 'tables', 'tibble', 'tidyr')
mdep <- pkgDep(mdep, suggests=FALSE)
sdep <- pkgDep("skimr", suggests=FALSE)
setdiff(sdep, mdep)
#> [1] "skimr" "repr"  "withr"

(Incidentally, when I grep the skimr repo, I don't see any call to repr, and just a single call to withr. Does skimr need those?)

Looooongwinded way of saying: "Yes, interested! What do you think are the benefits for different output formats?"

For reference, the internal function modelsummary::factory accepts a dataframe or tibble, does some minor tranforms, and dispatches it to one of factory_* functions for each of the table-making packages. The only thing that datasummary_skim needs to do is prepare a dataframe and feed it to factory.

vincentarelbundock commented 4 years ago

FYI, I have some prototype code using kableExtra's new inline plot functionality. Please don't invest time in a PR until I decide if I want to move away from unicode to inline svg.

In any case, I plan to (a) change the output somewhat so it differs from skim, and (2) add an explicit mention to skimr in the detailed description of the function so that people know the intellectual origin when they type datasummary_skim.

I'll keep you updated on whether I move to skimr or kableExtra.

Thanks again,.

elinw commented 4 years ago

Okay, let me know. Doing a custom skimmer is really easy with our function factory, and now that we are at skimr 2 we don't expect that API to change at all. Yes in skimr the histograms don't currently render on Windows (we test for windows and then don't do it). Skimr 2 skim() has a pretty different API in that it returns a tibble/data frame but you can also return just the numerics or a list by class. We've found that there is a lot of complexity in doing this (many edge cases and so on) and you might want to put that on us.

vincentarelbundock commented 4 years ago

Sounds great! Thanks again for your suggestions! I will definitely consider switching at some point if I run into too many data-handling issues. For now, I have:

Added a shout-out to skimr in the docs
Modified the output somewhat to include Min-Med-Max instead of all quartiles
Moved to kableExtra's spec_hist, which gives me automatic support for Windows, but no support for markdown histograms (but support for HTML and Rmarkdown PDF+HTML).

Screen Shot 2020-09-15 at 15 24 29

Screen Shot 2020-09-15 at 15 26 07

Screen Shot 2020-09-15 at 15 31 57

vincentarelbundock / modelsummary

Skim function #132