Open hallmeier opened 1 year ago
The issue here is that you are doing two joins at once. While technically this is going to work, as we allow multiple join nodes internally, this is not something we ever guaranteed to work. If you look at [i, j, ...] documentation you will notice, that there is only one join
parameter, hence, there is only one g
namespace.
While we eventually could add official support for multiple joins and multiple g
namespaces (though it could be pretty cumbersome for users), for the moment I would not recommend to do [i, j, join(...), join(...), ...]
, because we don't even cover that in our tests.
As a workaround, I would propose to split your logic into several steps, i.e.
>>> DT = df[g[0] != None, :, join(jf)]
>>> DT[:, [f["A"], f["B"], g["D"].alias("C")], join(jf2)]
| A B C
| str32 str32 int32
-- + ----- ----- -----
0 | b e 3
1 | b f 4
2 | c f 4
[3 rows x 3 columns]
Okay, thank you for your answer. In the documentation of the join
parameter it says "This parameter may be listed multiple times if you need to join with several frames.", so I thought it was intended functionality. I propose you clarify this a bit more, depending on how you plan to move forward regarding multiple join frames. While this would be cool functionality that has some uses, I understand that considering the technical implications everywhere makes development more complex.
Yes, you are right. But from the signature it is not obvious one can do multiple joins and probably we didn’t think it through with respect to addressing other joining frames. I also do not see we have even one test that tests multiple join functionality.
My feeling is that if we allow multiple joins we must have a way to address the frame’s columns. The problem is how the new namespaces should look like: — new letters; — a list of namespaces; … The options I just listed are not really good from the user perspective, I guess. Though the second one could be acceptable to some extend.
@hallmeier do you mind pointing me to the link with the quote you referenced about calling the parameter multiple times?
It's right in the __getitem__
documentation oleksiys linked
Wow I like the fact that you can join multiple frames... Keeping track of namespaces might be complex 🤷♂️. At any rate I don't think it should be deprecated, probably update d docs to say that at the moment only two namespaces are supported, with an example
Yeah, we definitely need to address this issue at some point, though it is not obvious to me how. The way we are doing it now with f
and g
is not really flexible when it comes to addressing an arbitrary joined frame.
The "list of namespaces" idea sounds good to me. f
and g
are pretty standard, so we shouldn't mess with them. But h
could hold a list of namespaces for joined frames after the first one.
An alternative idea is to have a dict
of namespaces populated by keyword arguments of join
. Normally, it is empty, but if you pass a frame to join()
as a keyword argument, you can retrieve its namespace by this name. If the join()
API should stay extensible for future keyword arguments, named frames could be passed as a dict in the first argument.
Yes, probably a dict
is better, because it is complicated to keep a track of the joined frames once their number is more than one.
I want to index
df
on columnA
withjf
and then join withjf2
to update columnC
withjf2
's columnD
(also naming itC
wouldn't help here).So after updating
df
would be:Joining works perfectly:
But I can't update in the same step because column
D
cannot be accessed. I'd like to do something like this:The columns of
df
are inf
and the columns ofjf
are ing
, but the columns ofjf2
cannot be accessed in thej
-statement.While this is a feature request, I'd also appreciate good ideas for workarounds.