h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 155 forks source link

Add a way to refer to multiple joined frames' columns #3421

Open hallmeier opened 1 year ago

hallmeier commented 1 year ago

I want to index df on column A with jf and then join with jf2 to update column C with jf2's column D (also naming it C wouldn't help here).

from datatable import dt, f, join, g

df = dt.Frame("""A B C
                 a e 0
                 b e 0
                 b f 0
                 c f 0
                 d f 2""")

jf = dt.Frame("""A
                 b
                 c""")
jf.key = "A"

jf2 = dt.Frame("""B D
                  e 3
                  f 4""")
jf2.key = "B"

So after updating df would be:

df_desired = dt.Frame("""A B C
                         a e 0
                         b e 3
                         b f 4
                         c f 4
                         d f 1""")

Joining works perfectly:

df[g[0] != None, :, join(jf), join(jf2)]
#    | A      B          C      D
#    | str32  str32  int32  int32
# -- + -----  -----  -----  -----
#  0 | b      e          0      3
#  1 | b      f          0      4
#  2 | c      f          0      4
# [3 rows x 4 columns]

But I can't update in the same step because column Dcannot be accessed. I'd like to do something like this:

df[g[0] != None, dt.update(C=g["D"]), join(jf), join(jf2)]
# datatable.exceptions.KeyError: Column D does not exist in the Frame; did you mean A?

The columns of df are in f and the columns of jf are in g, but the columns of jf2 cannot be accessed in the j-statement.

While this is a feature request, I'd also appreciate good ideas for workarounds.

oleksiyskononenko commented 1 year ago

The issue here is that you are doing two joins at once. While technically this is going to work, as we allow multiple join nodes internally, this is not something we ever guaranteed to work. If you look at [i, j, ...] documentation you will notice, that there is only one join parameter, hence, there is only one g namespace.

While we eventually could add official support for multiple joins and multiple g namespaces (though it could be pretty cumbersome for users), for the moment I would not recommend to do [i, j, join(...), join(...), ...], because we don't even cover that in our tests.

As a workaround, I would propose to split your logic into several steps, i.e.

>>> DT = df[g[0] != None, :, join(jf)]
>>> DT[:, [f["A"], f["B"], g["D"].alias("C")], join(jf2)]
   | A      B          C
   | str32  str32  int32
-- + -----  -----  -----
 0 | b      e          3
 1 | b      f          4
 2 | c      f          4
[3 rows x 3 columns]
hallmeier commented 1 year ago

Okay, thank you for your answer. In the documentation of the join parameter it says "This parameter may be listed multiple times if you need to join with several frames.", so I thought it was intended functionality. I propose you clarify this a bit more, depending on how you plan to move forward regarding multiple join frames. While this would be cool functionality that has some uses, I understand that considering the technical implications everywhere makes development more complex.

oleksiyskononenko commented 1 year ago

Yes, you are right. But from the signature it is not obvious one can do multiple joins and probably we didn’t think it through with respect to addressing other joining frames. I also do not see we have even one test that tests multiple join functionality.

My feeling is that if we allow multiple joins we must have a way to address the frame’s columns. The problem is how the new namespaces should look like: — new letters; — a list of namespaces; … The options I just listed are not really good from the user perspective, I guess. Though the second one could be acceptable to some extend.

samukweku commented 1 year ago

@hallmeier do you mind pointing me to the link with the quote you referenced about calling the parameter multiple times?

hallmeier commented 1 year ago

It's right in the __getitem__ documentation oleksiys linked

samukweku commented 1 year ago

Wow I like the fact that you can join multiple frames... Keeping track of namespaces might be complex 🤷‍♂️. At any rate I don't think it should be deprecated, probably update d docs to say that at the moment only two namespaces are supported, with an example

oleksiyskononenko commented 1 year ago

Yeah, we definitely need to address this issue at some point, though it is not obvious to me how. The way we are doing it now with f and g is not really flexible when it comes to addressing an arbitrary joined frame.

hallmeier commented 1 year ago

The "list of namespaces" idea sounds good to me. f and g are pretty standard, so we shouldn't mess with them. But h could hold a list of namespaces for joined frames after the first one.

An alternative idea is to have a dict of namespaces populated by keyword arguments of join. Normally, it is empty, but if you pass a frame to join() as a keyword argument, you can retrieve its namespace by this name. If the join() API should stay extensible for future keyword arguments, named frames could be passed as a dict in the first argument.

oleksiyskononenko commented 1 year ago

Yes, probably a dict is better, because it is complicated to keep a track of the joined frames once their number is more than one.