spgarbet / tangram

Table Grammar package for R
66 stars 3 forks source link

What happened to LaTeX? #1

Closed kkmann closed 7 years ago

kkmann commented 7 years ago

Hi,

great project, if this works out well, it might become more important than ggplot considering that there are quite some empirical papers without graphs but few without tables ;) So, what are the plans concerning LaTeX output? My use case is the usual *.Rmd -> .pdf workflow via RStudio and I saw a mention of a latex() method in the docs but not the code.

Are there any plans on deepening the documentation/publishing this at JSS? I tried to understand the concept from the source files but did not quite get it. Will it be possible to define custom types via something like X1::MyType[param1, param2, ...] in the formula and defining the behavior for MyType? Otherwise defining custom tables could become quite nasty if you want all numerics in a specific format with say median and IQR.

Will it be possible to define the layout for each row/col intersection? I am thinking of a table with many statistics and many strata - it might be wise to layout the statistics vertically to save horizontal space. In extreme cases it might also be necessary to go the 'ultra-wide' way by aligning the strata vertically as well...

spgarbet commented 7 years ago

Wonderful comments. This is a very early in development package. I'm focusing on RTF at present and about have that done. RTF is critical for closing the RStudio -> Word loop with clinical collaborators. LaTeX is on the to do list. It's actually not a lot of work to add LaTeX at this point. However, I've got a refactor coming up on the key middle layers (described below).

I like the idea of publishing at JSS. I will pursue this when it become more mature.

It is entirely possible to create custom types and define processing for them--this is one of the key freedoms in the design of the package. Right now the framework is a bit rigid in terms of internal storage of table elements, and I've got a draft I'm working on to relax that, it looks like I'm going to have to move forward with refactoring all the code. In general right now you have to define an S3 object to hold each type of output, and that is not sustainable. The rough idea is that instead of storing "F-stat", I would store a named value that allowed for greek. Then have a container that allowed for multiple values. Still need the ideas of fractions and IQR, as these are not easily further abstracted. The point being that the package makes no assumptions about what statistics one puts into a table--which is exactly what you are requesting. My opinions about what constitutes a table summary should not be imposed on a user. The concept of defining processing via type requires defining handling for each possible cross product of types and they layout has to remain consistent. So, yes it will not only be possible it will be encouraged. The many strata case is handled by the "*" operator in a formula. Horizontal and vertical are both easily done. Adding this to the default Hmisc handling is also on the todo list.

Do you know of other table outputs besides text, single estimate, an IQR, a fraction and a list of named values that would be useful?

Another item on the todo list is to provide a simple single statistic summary, ala SAS table proc for any function that takes two variables. This also is not too difficult. So many directions to run for to round this out.

The good news is that since this is early in the package development cycle, I'm looking for examples to implement. If you have a use case that would be helpful for you right now, I'm more than willing to fold that onto my to do list and turn it around quickly. Send me munged data and a sketch of what you're looking for and I'll see what I can do. Even better send me data you've published already and I'll fold it in as an example.

kkmann commented 7 years ago

I see, better to get the core up and running before getting into too much dirty work with the presentation!

Okay, so I am getting a little bit off-topic now but I guess its not really specific enough for a separate issue. Just a few random ideas from my side.

  1. I have been thinking about implementing a tool for descriptive tables with a more ggplot-type syntax where your geoms would become something like "+ stats_groups(X1, X2, ...)" defining the contents of the table cells (mean, std, median or what else you like...). These would be easy to create by the user as it is essentially just a named list of functions. Strata would be added by something like "+ stratify_by(~G1)", "+ stratify_by(~G1|G2)", etc. This is of course, different from your grammar, as it does not treat rows and column alike and assumes that one dimension of the table (the strata/grouping) is always categorical. On the upside this allows an easier definition of the contents of the table cells as the user only needs to specify a list of statistics to compute stratified by the categorical dimension of the table (I cannot think of a practical use case for a continuous second dimension). So, the only thing the user has to define are custom stats() objects (get df and strata -> compute statistic and a suitable p value over strata). These can than be grouped into custom stats_group() and applied to data. E.g. table(data) + my_stats_group(X1, X3, options ...) + other_stats_group(X2) + stratify_by(~ G1|G2 + G3) + layout(stats|var ~ strata) I just really like the way ggplot decomposes a plot into "orthogonal" components and that the "+"-notation reflects that. I hope this is somewhat understandable x).

  2. Cell content. Well, it should really be possible to specify any number of statistics and for each of them a "least wrong way" of getting a p value. Besides I was thinking about also allowing graphs in the table to be maximally general (and play along nicely with a continuous "grouping variable"). This would, most certainly, work only by outputting the tables as graphs but would be so cool.

I am thrilled to see were your project is going - I guess we agree that something of the kind was long overdue ; )

spgarbet commented 7 years ago

Your comments reflect a very clear understanding of what's in the library and it's goals. You touch on many things I've struggled with.

The current default Hmisc processor does not assume that one dimension is always categorical (Hmisc::summaryM does however). Any variable can be cast to categorical. Further, with the type and processing overrides available it makes zero assumptions in this regard--however that requires a lot more work on the user's part. Here's a example with the current release:

> summary_table(y ~ x, data.frame(x=rnorm(20), y=rnorm(20)))
============================================ 
   N             y            Test Statistic 

-------------------------------------------- 
x  20  rho=0.251127819548872  S=996, P=0.286 
============================================ 

[Ugh. Note to self: I need to work on the correlation "rho" formatting.]

On the use of '|' as an operator. I struggled with how to define this cleanly in an algebra. It's really a '' with limited scope in one since. '' has a defined algebra in a ring, whereas '|' doesn't. One could view the '' with 2 variables xy as an equivalent of 'x|y'. However, then one can't use the '' as direct multiplication. This is an issue I've not come up with a clean answer too, and at present just left it at '' since I've not gotten that fully implemented in the hmisc processor.

In reading your composition example. I like ggplots orthogonal semantic construction. I don't know what is orthogonal in table construction outside of headers, styling and various sub groups of stat processing. The library separates styling into the rendering layer--this is complicated by the huge number of target formats and their completely different syntax coding and formats available.

The first item in your example is data. The second adds a stats group (X1 and X3). The third is another separate function which I interpret as adding another ground. Your stratification function then refers to them by G1 and G2, which is an implicit naming assumption. Then a separate layout function. A data frame or table "+" a table function is doable. Let's assume that the next two functions are the same, "grouping" and we get rid of the implicit naming. This becomes (if I follow)

table(data) + grouping(X1|X3, "G1") + grouping(X2, "G2") + stratify(~ G1+G2) + layout(stats ~strata)

At this point there are three more implicit names invoked, stats, var and strata. I don't know what var or strat means so I dropped that. It's bad to start using up pieces of the namespace unless there is a very clear benefit. Something along these lines could be done such that it generates the processor input of an abstract syntax tree. I see this is a bit more confusing however, as I can't reason what the table would look like after all that. Could you create a small example with random data and show what the table would look like for that example?

In the current interface I think it would look like this, which seems far simpler to me. summary_table(1 ~ X1X3::Categorical + X2, data) or summary_table(X1X3::Categorial + X2 ~ 1, data)

The current syntax allows for user defined functions to return data for things like "log(x)". If a function returns table elements these are incorporated into the table without further processing, thus overrides of table generation are allowed directly. This is one thing I made #1 priority through the design--override ability at every layer.

There was another request to do something similar with "+" to compose tables along the lines of lbind and rbind. I think this would be trivial to add, and have "+" at this higher level mean rbind.

I have an internal commitment with 2 parties to get RTF finished with some nice NEJM styles. To do this right I'm going to have to redefine some of the cell handling. I would love to be able to add graphs as part of the handling. Unfortunately the only familiarity I have with embedding graphs is in LaTeX.

I think what I should do for the next step is make public my todo list here. Thus folks are free to comment and provide feedback.

spgarbet commented 7 years ago

Since there's no further discussion. I opened Issue #2 for this question.

spgarbet commented 7 years ago

FYI, A working 1st version of LaTeX support is up.