h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 155 forks source link

Feature Request : Reshape Data within DataTable #2499

Open samukweku opened 4 years ago

samukweku commented 4 years ago

Example :

df = dt.Frame({"A":['a','b','c'], "B":[1,3,5],"C":[2,4,6]})

Transform from wide to long :

# new shape:

A variable value 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6

Transform from long to wide : A B C 0 a 1 2 1 b 3 4 2 c 5 6

It would also be beneficial if there was flexibility in the reshaping, something similar to patterns in rdatatable that allows for regular expressions.

I currently hardcode the reshaping process, by building a list of dataframes, then rbind or cbind as the case may be.

Note : rbind does not accept a generator expression, only lists. dt.rbind([df[:,f[0].extend({"var":name, "val":f[name]})] for name in df.names[1:]])

Thanks!

aniket328 commented 3 years ago

I am working on this issue, kindly assign it to me, will be creating PR in few days.

danijak commented 3 years ago

I am working with R. I want to contribute to this task.

oleksiyskononenko commented 3 years ago

@aniket328 I've assigned this issue to you.

@danijak Please coordinate with @aniket328 who seems to work on this issue already.

Thanks you both for your contributions!

aniket328 commented 3 years ago

Sure @danijak, let's collaborate. :)

st-pasha commented 3 years ago

Hey @aniket328, @danijak -- are you coming from GSoC by any chance?

danijak commented 3 years ago

@st-pasha, I am GSoc'2021 Aspirant. I am looking to contribute in this issue

aniket328 commented 3 years ago

Yes @st-pasha , I will be submitting a proposal for working in the R-Project for GSoC 2021

danijak commented 3 years ago

@aniket328 I've assigned this issue to you.

@danijak Please coordinate with @aniket328 who seems to work on this issue already. Thanks you both for your contributions!

Thanks @oleksiyskononenko, I will check other issues as well. If there are any urgent issues to be resolved please do tell me.

theadityasam commented 3 years ago

Hey @st-pasha @samukweku @oleksiyskononenko I would love to work on this issue for the summer of 2021. I've experience with the data.table R package and have myself used melt extensively. This is the link to the easy tests demonstrating my familiarity with the data.table R package.

oleksiyskononenko commented 3 years ago

Dear @theadityasam, it seems that there are already two people working on it — see comments above.

theadityasam commented 3 years ago

Will try my hands on the issue in my own fork meanwhile since I don't see any development yet.

oleksiyskononenko commented 3 years ago

@theadityasam you're welcome. I don't see any developments either.

st-pasha commented 3 years ago

Hi @danijak, @aniket328 -- please let us know if you guys still working on this project?

aniket328 commented 3 years ago

@st-pasha I am not working on this project for now.

theadityasam commented 3 years ago

Since no one is working on this, I'll be working on implementing the reshape functionalities in the coming weeks.

theadityasam commented 3 years ago

Hey everyone, in what order should the melt be performed? In this discussion, @st-pasha had mentioned that the R data.table takes the column-wise approach. Should that be followed over here too?

st-pasha commented 3 years ago

Hi @theadityasam. Since there is no clearly preferred alternative there, the best approach is to have a parameter in melt() parameter that would control whether to do it one way or the other.

The other question then becomes what should be the default way for that parameter. As mentioned in #2677, different packages have implemented it differently, and so we can't really rely on their wisdom here. Surely, the fact that both pandas and R data.table have implemented it column-wise is a strong indicator, but the fact that more mature and professional packages such as SQL, SAS, or Stata prefer row-wise approach is also a strong indicator.

So, going forward we can do two things: 1) find the issue/PR in pandas/data.table where those features were implemented, and see whether there was a thorough discussion of the merits of different approaches; 2) ask for an opinion on relevant discussion forums, such as StackOverflow, Reddit, etc.

theadityasam commented 3 years ago

Understood, the defaults can be discussed later on. Quick question, I've read the docs and came across "FExpr". In R, the transfer of data from R to C++ and vice-versa is handled through SEXP pointers. Is FExpr analogous to that? From what I understood after reading the API reference, for creating the melt feature, I need to inherit FExpre base class in the newly created melt class, add the melt code and finally and FExpr type will be returned? All the conversions between Python and C++ datatypes will be handled by this mechanism? Also, @st-pasha will you be available to mentor for this project for GSOC'21? It would be really helpful to work on this project over a rigid timeline under the guidance of mentors.

st-pasha commented 3 years ago

AFAIK, R's SEXP corresponds to python's PyObject* (which we wrap into a py::oobj for automatic reference counting and more convenient API). The FExpr class, however, is more specialized: it serves as a stored expression tree. In R the arguments to functions are always passed as unevaluated expression trees, so there is no need to disambiguate.

When it comes to returning values to python, we need to wrap C++ classes/primitives into the appropriate wrapper objects. For example, in order to return an integer, we'd write return py::oint(1), a string: return py::ostring("ok"), etc. (The "o" in the name indicates that the object owns its pointer). Returning a Frame works the same: return py::Frame::oframe(new DataTable(...)) (note that C++ equivalent to python's Frame is called DataTable, for historical reasons).

tdhock commented 3 years ago

If you ever get around to implementing wide-to-long data reshape function (melt) then it would be great if you could support something like the new measure() function which was recently merged into R data.table, https://github.com/Rdatatable/data.table/pull/4731

> DT <- data.table(id=1, a_1=10, b_2=21, a_2=20)
> melt(DT, measure.vars=measure(letter, number, pattern="(.*)_(.*)"))
      id letter number value
   <num> <char> <char> <num>
1:     1      a      1    10
2:     1      b      2    21
3:     1      a      2    20
> melt(DT, measure.vars=measure(letter, number=as.integer, pattern="(.*)_(.*)"))
      id letter number value
   <num> <char>  <int> <num>
1:     1      a      1    10
2:     1      b      2    21
3:     1      a      2    20
> melt(DT, measure.vars=measure(value.name, number=as.integer, pattern="(.*)_(.*)"))
      id number     a     b
   <num>  <int> <num> <num>
1:     1      1    10    NA
2:     1      2    20    21

There seems to be no equivalent in pandas, so implementing this feature could give datatable an advantage. The closest analog is pandas.wide_to_long but it is not as flexible, see https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-by-melt

samukweku commented 3 years ago

@tdhock not in pandas but there is sth similar in pyjanitor which builds off ideas in R's pivot_longer and data.table.

However, I would like to see the implementation that @st-pasha has in mind, where melt serves as the building block for more transformations.

At the moment, datatable has sparse support for string operations ... I'm not sure how patterns will be implemented.

Waiting patiently to see the implementation, with arguments that python users can grok easily

theadityasam commented 3 years ago

Hey @tdhock @samukweku @st-pasha I'll try to implement this function however I would require some guidance on how things work in datatable. Can we schedule a 30 min call wherein someone can help me with the workings of the python package, some sort of a knowledge transfer? That would make it incredibly simple for me and I would be able to work on it faster.

aaelony-aeg commented 1 year ago

Did dcast get implemented?

samukweku commented 1 year ago

@aaelony-aeg no it hasn't been implemented yet, still on the roadmap