Open samukweku opened 4 years ago
I am working on this issue, kindly assign it to me, will be creating PR in few days.
I am working with R. I want to contribute to this task.
@aniket328 I've assigned this issue to you.
@danijak Please coordinate with @aniket328 who seems to work on this issue already.
Thanks you both for your contributions!
Sure @danijak, let's collaborate. :)
Hey @aniket328, @danijak -- are you coming from GSoC by any chance?
@st-pasha, I am GSoc'2021 Aspirant. I am looking to contribute in this issue
Yes @st-pasha , I will be submitting a proposal for working in the R-Project for GSoC 2021
@aniket328 I've assigned this issue to you.
@danijak Please coordinate with @aniket328 who seems to work on this issue already. Thanks you both for your contributions!
Thanks @oleksiyskononenko, I will check other issues as well. If there are any urgent issues to be resolved please do tell me.
Hey @st-pasha @samukweku @oleksiyskononenko I would love to work on this issue for the summer of 2021. I've experience with the data.table R package and have myself used melt extensively. This is the link to the easy tests demonstrating my familiarity with the data.table R package.
Dear @theadityasam, it seems that there are already two people working on it — see comments above.
Will try my hands on the issue in my own fork meanwhile since I don't see any development yet.
@theadityasam you're welcome. I don't see any developments either.
Hi @danijak, @aniket328 -- please let us know if you guys still working on this project?
@st-pasha I am not working on this project for now.
Since no one is working on this, I'll be working on implementing the reshape functionalities in the coming weeks.
Hey everyone, in what order should the melt be performed? In this discussion, @st-pasha had mentioned that the R data.table takes the column-wise approach. Should that be followed over here too?
Hi @theadityasam. Since there is no clearly preferred alternative there, the best approach is to have a parameter in melt()
parameter that would control whether to do it one way or the other.
The other question then becomes what should be the default way for that parameter. As mentioned in #2677, different packages have implemented it differently, and so we can't really rely on their wisdom here. Surely, the fact that both pandas and R data.table
have implemented it column-wise is a strong indicator, but the fact that more mature and professional packages such as SQL, SAS, or Stata prefer row-wise approach is also a strong indicator.
So, going forward we can do two things: 1) find the issue/PR in pandas/data.table where those features were implemented, and see whether there was a thorough discussion of the merits of different approaches; 2) ask for an opinion on relevant discussion forums, such as StackOverflow, Reddit, etc.
Understood, the defaults can be discussed later on. Quick question, I've read the docs and came across "FExpr". In R, the transfer of data from R to C++ and vice-versa is handled through SEXP pointers. Is FExpr analogous to that? From what I understood after reading the API reference, for creating the melt feature, I need to inherit FExpre base class in the newly created melt class, add the melt code and finally and FExpr type will be returned? All the conversions between Python and C++ datatypes will be handled by this mechanism? Also, @st-pasha will you be available to mentor for this project for GSOC'21? It would be really helpful to work on this project over a rigid timeline under the guidance of mentors.
AFAIK, R's SEXP
corresponds to python's PyObject*
(which we wrap into a py::oobj
for automatic reference counting and more convenient API).
The FExpr
class, however, is more specialized: it serves as a stored expression tree. In R the arguments to functions are always passed as unevaluated expression trees, so there is no need to disambiguate.
When it comes to returning values to python, we need to wrap C++ classes/primitives into the appropriate wrapper objects. For example, in order to return an integer, we'd write return py::oint(1)
, a string: return py::ostring("ok")
, etc. (The "o" in the name indicates that the object owns its pointer). Returning a Frame
works the same: return py::Frame::oframe(new DataTable(...))
(note that C++ equivalent to python's Frame
is called DataTable
, for historical reasons).
If you ever get around to implementing wide-to-long data reshape function (melt) then it would be great if you could support something like the new measure() function which was recently merged into R data.table, https://github.com/Rdatatable/data.table/pull/4731
> DT <- data.table(id=1, a_1=10, b_2=21, a_2=20)
> melt(DT, measure.vars=measure(letter, number, pattern="(.*)_(.*)"))
id letter number value
<num> <char> <char> <num>
1: 1 a 1 10
2: 1 b 2 21
3: 1 a 2 20
> melt(DT, measure.vars=measure(letter, number=as.integer, pattern="(.*)_(.*)"))
id letter number value
<num> <char> <int> <num>
1: 1 a 1 10
2: 1 b 2 21
3: 1 a 2 20
> melt(DT, measure.vars=measure(value.name, number=as.integer, pattern="(.*)_(.*)"))
id number a b
<num> <int> <num> <num>
1: 1 1 10 NA
2: 1 2 20 21
There seems to be no equivalent in pandas, so implementing this feature could give datatable an advantage. The closest analog is pandas.wide_to_long but it is not as flexible, see https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-by-melt
@tdhock not in pandas but there is sth similar in pyjanitor which builds off ideas in R's pivot_longer
and data.table
.
However, I would like to see the implementation that @st-pasha has in mind, where melt
serves as the building block for more transformations.
At the moment, datatable
has sparse support for string operations ... I'm not sure how patterns
will be implemented.
Waiting patiently to see the implementation, with arguments that python users can grok easily
Hey @tdhock @samukweku @st-pasha I'll try to implement this function however I would require some guidance on how things work in datatable. Can we schedule a 30 min call wherein someone can help me with the workings of the python package, some sort of a knowledge transfer? That would make it incredibly simple for me and I would be able to work on it faster.
Did dcast
get implemented?
@aaelony-aeg no it hasn't been implemented yet, still on the roadmap
melt/dcast
inrdatatable
ormelt/pivot.pd.wide_to_long
inpandas
)Example :
Transform from wide to long :
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6
Transform from long to wide :
A B C
0 a 1 2
1 b 3 4
2 c 5 6
It would also be beneficial if there was flexibility in the reshaping, something similar to
patterns
inrdatatable
that allows for regular expressions.I currently hardcode the reshaping process, by building a list of dataframes, then
rbind
orcbind
as the case may be.Note :
rbind
does not accept a generator expression, only lists.dt.rbind([df[:,f[0].extend({"var":name, "val":f[name]})] for name in df.names[1:]])
Thanks!