queryverse / Query.jl

Query almost anything in julia
Other
394 stars 49 forks source link

Adding functions that transform a table longer and wider #319

Open tnederlof opened 4 years ago

tnederlof commented 4 years ago

A pretty common operation with tabular data is to make data longer and make data wide. These type of operations are often referred to as pivots, stack/unstack, spread/gather, melt/cast, etc. I wanted to link to different implementations of this idea both in Julia and outside of Julia, link to an attempt @davidanthoff has already started, and then start the design of a long-lasting implementation.

My knowledge of the inner workings of the queryverse is pretty limited and with all of the macro use, I expect it will take me a bit to get comfortable but I welcome the challenge. While some design decisions must be thought of within the context of the code, some high-level feature design can probably be discussed before laying down the implementation code.

David and I had a brief conversation about this here: https://discourse.julialang.org/t/stack-and-unstack-in-query-jl/45381/4

Some reference links

DataFrames.jl stack - http://juliadata.github.io/DataFrames.jl/stable/lib/functions/#DataFrames.stack unstack - http://juliadata.github.io/DataFrames.jl/stable/lib/functions/#DataFrames.unstack

reshape2 (R) melt - https://www.rdocumentation.org/packages/reshape2/versions/1.4.4/topics/melt cast/dcast - https://www.rdocumentation.org/packages/reshape2/versions/1.4.4/topics/cast

tidyr pre v1.0 (R) gather - https://tidyr.tidyverse.org/reference/gather.html spread - https://tidyr.tidyverse.org/reference/spread.html

tidyr v1.0 (R) pivot_longer - https://tidyr.tidyverse.org/reference/pivot_longer.html pivot_wider - https://tidyr.tidyverse.org/reference/pivot_wider.html

pandas (python) melt/wide_to_long - https://pandas.pydata.org/docs/reference/api/pandas.melt.html & https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html pivot - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html

Common Features

making data "longer"

making data "wider"

The above is a big simplification as most implementations of these functions do "extra" things including...

Of this list above I imagine the decision whether or not to handle multiple transformations at once in the query.jl implementation will be non-trivial. In order to handle multiple transformations, one has to set rules on how to take column names and parse them into discreet parts (each of which becomes a new identifier column) for the wide to long case. In the long to wide case you are usually selecting an identifier variable column and multiple observation columns, the combination of which become new wide columns. Tidyr lays this out nicely so there is plenty to look at when it comes to these cases but it definitely adds complexity and additional arguments (separators between newly created columns, prefixes, pattern matching, etc.)

Latest Thinking

Hadley (creator of tidyr) has thought a lot about this subject and has written numerous posts over the years. He recently wrote this https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0/ and gave a talk about this subject https://www.youtube.com/watch?v=D48JHU4llkk. His main point is all of these confusing names like spread, gather, melt, cast, pivot, etc make it really hard for the average user to remember what they do without using documentation, and even then it can be confusing. He has settled on pivot_longer (result has less columns and more rows) and pivot_wider (result has more columns and less rows). I personally love using _longer and _wider in the name as it makes it clear which direction you are moving in, however, there are probably different prefixes that could all work.

@davidanthoff has created a gather (wide to long transformation) function (https://github.com/queryverse/QueryOperators.jl/blob/master/src/enumerable/enumerable_gather.jl) which will be a big help for me to get up to speed on the inner workings.

This was a long post so please add/remove/correct/discuss what I wrote above so we can understand what has come before and then design a well-thought set of transformation functions.