ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

A framework for reproducible tables #69

Open bzkrouse opened 7 years ago

bzkrouse commented 7 years ago

In my work (clinical research), we make a lot of tables, usually comparing 2 or more groups. It's nice to format the table programmatically so that it is reproducible and ready for publication. The process to do so usually looks something like this:

With tidy tools like dplyr, broom, and purrr, it is easier than ever before to create the self-contained data frame. However, getting all the necessary pieces and working the df into a table-ready format is a process that seems to be recreated from scratch each time. It would be great to have a tool that helps to automate this process a bit. Here's some vague thoughts on what this could look like:

Does anyone have any interest or thoughts about this topic? Are there any tools already out there that help with this? If not an unconf project, would love a related discussion about people’s workflows!

sfirke commented 7 years ago

I so hear this. I think the formatting aspect of summary tables in R is quite tedious and a barrier to winning people over from Excel for routine analyses.

I took a crack at this with the janitor package, specifically creating tabulations and 2-way/crosstab/contingency tables and formatting them with percentages, rounding, etc. for quick publication. Though I have focused on simple counting and percentages, not any statistics; but maybe the formatting aspect could be leveraged?

I'm rethinking the approach to janitor's tabulations and formatting, making the functions more modular and coherent and less a set of utilities. If this comes kind of close, maybe something could be built into janitor or those functions or ideas could be extended. Or if it should be something separate, that's great too and I'd love to help ⛏

haozhu233 commented 7 years ago

To help generating nice looking table with grouping factors, I wrote a package called kableExtra a few months ago. So basically you can do something like below in a pdf_document

library(dplyr)
library(knitr)
library(kableExtra)
library(ezsummary)

mtcars %>%
  group_by(cyl) %>%
  ezsummary(flavor = "wide") %>%
  kable(format = "latex", booktabs = T,
        col.names = c("variable", rep(c("mean", "sd"), 3))) %>%
  add_header_above(c(" ", "4 cyl" = 2, "6 cyl" = 2, "8 cyl" = 2))

Then you will get something like

screen shot 2017-05-03 at 7 52 08 pm

(ezsummary is something I wrote in the past but some design aspects of it is a little below my expectation but I still use it sometimes. :P )

njtierney commented 7 years ago

This is a great idea!

Creating these tables is something I find so frustrating when writing a paper or report. It's totally one of those things where I've just gone:

Bah! I'll just write in the values manually just this once

Except it's almost never just this once, and it adds to reproducibility hell.

Having a tool(s) that makes it easier to create these sorts of tables would for sure ease one a pressure point in reproducibility.

ezsummary and kableExtra both look amazing, @haozhu233! I'd love to learn more about ezsummary and kableExtra and see if we can develop them further.

stephlocke commented 7 years ago

I don't know if this might be of interest but I met the guy who built this the other day -I was impressed with the level of docs https://cran.r-project.org/web/packages/pivottabler/index.html

batpigandme commented 7 years ago

@njtierney I'm digging kableExtra atm, but I've also used huxtable, and pander. huxtable has a "table of regressions" format, which you can see here. I'm sure there are still lots of gaps, just wanted to put these out there.

njtierney commented 7 years ago

Ah, good to know! It's great to gather all these resources together!

Maybe we can work together on some examples of table we have made for papers/reports, and try all these different methods/pkgs out, and then work out what was great and what could be improved?

bzkrouse commented 7 years ago

Wow, it's nice to hear other people are having similar thoughts (well said @njtierney !). @sfirke and @haozhu233 - really appreciate the tools you've built and the fact that you've already spent so much time thinking about this problem. If any of these tools can be leveraged or extended that would be amazing. It would be great to figure out a way to incorporate more of the statistical/modeling aspect of the analysis. More specifically - in the case of a table that contains many models/test, a potential tool could pair nicely with the purrr workflow.

I will mention the tangram package that came onto my radar yesterday that I don't know much about but seems to have a unique table building model.

haozhu233 commented 7 years ago

@njtierney Great idea! I think this type of "literature review" will be very useful for our community. After that we will have a better understanding of what we have right now and what exactly we need. I can imagine during the unconf, we can easily generate a blog post that @stefaniebutland would like to see. ;)

njtierney commented 7 years ago

So many interesting things to work on at the unconf!

jhollist commented 7 years ago

Agree with all that this is needed and this thread helps summarize a lot. Just an idea, what about a gallery of tables with the code to prodcue them. Something similar to @haozhu233 example above, but for different typical table types.

haozhu233 commented 7 years ago

I feel like having a gallery thing @jhollist just mentioned will definitely be super helpful. We can also borrow some ideas from the design of ggplot2 that having ggplot(), which is powerful and customizable, and qplot(), which is bootstrapping common plot types, at the same time.

bzkrouse commented 7 years ago

These are great ideas! I agree the lit review and gallery concept will both be very helpful and great resources for the broader community. It would be nice to take stock of what tools are out there and what types of tables should be covered. @haozhu233 - yes!! to your idea of structuring like ggplot2. That sounds ideal for a tool that is meant to be easy to use and easily extensible. Maybe we could try out a paradigm where you start with a simple table and add "layers" of details, complexities, and/or customizations.

elinw commented 7 years ago

This is great, I actually mentioned something like this to @stefaniebutland in my talk with her. It's a huge issue in sociology because we make crosstabs a lot and they are really a pain overall in R especially with multiple variables. This is something I wrote to make crosstab making easier for my students (and me) https://github.com/elinw/lehmansociology/blob/master/R/crosstab.R but the print function is really painful. Even what should be simple frequency tables are hard in base R, this is what we came up with just to illustrate https://github.com/elinw/lehmansociology/blob/master/R/frequency.R. @sfirke I'm going to have a look at janitor!

elinw commented 7 years ago

Wow, kableExtra, nice!

If we are making a literature review then I think formattable should be in there. And of course tables.
I have a lot of PHP/Web experience and the way tables are handled in R always feels very different to me.

karawoo commented 7 years ago

Someone gave a lightning talk at the Seattle useR meetup on this topic a few months ago. He showed a few examples, one of which was tableone

haozhu233 commented 7 years ago

Just saw desctable on my github timeline. It seems to be another good fit for this issue.

bzkrouse commented 7 years ago

Nice, lots of examples :) desctable is really interesting! It seems to focus on ease of process & content than styling (it's my impression that some of these packages seem to emphasize one or the other). I'll throw another one into the mix: arsenal, which gets more into stats and models. (@elinw this may be of interest to you for frequency tables...)

sfirke commented 7 years ago

There are also older-school table printing options, like gmodels::CrossTable() and there's one in Hmisc (I think summary?). A literature review of what's out there and how it differs would be a boon to folks navigating all of the options. Makes me think of reviews of a field of products from The WireCutter.

bzkrouse commented 7 years ago

Summary of this thread:

There are lots of existing packages/functions for creating and/or formatting tables of various types. There seems to be a consensus that more work may be needed in this area, but we first need to understand all that is available right now. The great discussion in #78 could inform this process. From there, we can determine what is needed going forward. Potential ideas for the unconf, summarized from discussion above:

1) Perform "lit review" of existing packages

2) Are there improvements to be made? If so, planning the future of tables in R:

jhollist commented 7 years ago

I will be following along with the unconf remotely (via slack, issues, twitter, etc.) I'll keep my eye on this and if there is anything I can do remotely, would be happy to do so. If it makes sense we could chat via appear.in (I'm not on skype).

I do really like this idea of "lit reviews" for packages. It feels like a more targeted/granular version of a task view and I think could be very useful. We've had fits and starts of a discussion on what to do with https://github.com/ropensci/maptools. It was intended to be a Task View but we got some push back due to the overlap with the Spatial Task View. Anyway, I think this general idea of targeted reviews could fill the void between "packages useful for a broad area" and "use package X to do Y"

And thanks for the interesting discussion!

On Wed, May 17, 2017 at 3:58 PM, Becca Krouse notifications@github.com wrote:

Summary of this thread:

There are lots of existing packages/functions for creating and/or formatting tables of various types. There seems to be a consensus that more work may be needed in this area, but we first need to understand all that is available right now. The great discussion in #78 https://github.com/ropensci/unconf17/issues/78 could inform this process. From there, we can determine what is needed going forward. Potential ideas for the unconf, summarized from discussion above:

1.

Perform "lit review" of existing packages

  • perform as a case study for #78 https://github.com/ropensci/unconf17/issues/78

    • compare existing packages by trying them out on a set of common table types
    • create a gallery of tables with the code to produce them
    • create blog post

    Are there improvements to be made? If so, planning the future of tables in R:

  • would be informed by the lit review
    • consider extend existing packages or creating a new one
    • borrow ideas from ggplot2

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/unconf17/issues/69#issuecomment-302214624, or mute the thread https://github.com/notifications/unsubscribe-auth/AFL8S2H6BRu-LZf7u0gHgMQVBTr4ASejks5r61FhgaJpZM4NP4C8 .

-- Jeff W. Hollister email: jeff.w.hollister@gmail.com cell: 401 556 4087

maelle commented 7 years ago

In case it wasn't in your list I just saw this https://gdemin.github.io/expss/

jsta commented 7 years ago

I was digging into the huxtable docs and found a vignette which compares the features of many table making packages: https://cran.r-project.org/web/packages/huxtable/vignettes/design-principles.html

wampeh1 commented 7 years ago

I work for the Federal Reserve Board (FRB). My duties include reading data from various sources (including pdf, excel, xml), processing these data, recompile and produce tables and charts Latex and FAME for publication purposes. I am currently searching for a similar tool(s) in R to replicate these processes (including creating tables). Very interested in this topic.

aammd commented 7 years ago

This issue reminds me a little of this silly joke flowchart i made. But perhaps a (slightly) more serious flowchart would be helpful to people?

I also wonder if it would be possible to create some kind of DSL for making tables that works with the pipe operator? Similar to @haozhu233 's suggestion to use something like a ggplot2 syntax

sfirke commented 7 years ago

A grammar of tables w/ modular piping functions, ala ggplot2, would be wonderful. I have been stumbling toward something similar, though in a limited use case (simple one-way and two-way tabulations) - so far I have (on a dev branch):

library(janitor)
mtcars %>%
  crosstab(cyl, am) %>% 
  adorn_totals("row") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>% 
  adorn_ns()

#>     cyl          0          1
#> 1     4 27.3%  (3) 72.7%  (8)
#> 2     6 57.1%  (4) 42.9%  (3)
#> 3     8 85.7% (12) 14.3%  (2)
#> 4 Total 59.4% (19) 40.6% (13)

But this is hardly a grammar - just a vote of enthusiasm for going in that direction 😀

gshotwell commented 7 years ago

dplyr::case_when() might be a good model for table formatting. Maybe something like this:

mtcars %>% 
  group_by(cyl) %>% 
  summarize(
            n = n(), 
            price = 10000 * wt,
            percent_wt = wt / sum(wt)) %>% 
  format(
          n ~ 'comma',
          price ~ 'euro',
          percent_wt ~ 'percent')
elinw commented 7 years ago

@GShotwell I like that idea a lot, in a strange way it's like tables but more like normal language. It would be great if there were a dplyr n() function (with some other name) that is a real function, I always end up needing n in doing calculations when creating tables.

stefaniebutland commented 7 years ago

"Combining the two issues, we set out to to create a guide that could help users navigate package selection, using the case of reproducible tables as a case study."

Repo: https://github.com/ropenscilabs/packagemetrics Blog post: packagemetrics - Helping you choose a package since runconf17