narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
592 stars 89 forks source link

fix: add cuDF specific implementation for join how="anti" #1041

Closed LiamConnors closed 1 month ago

LiamConnors commented 1 month ago

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

Right now these tests fail with NotImplementedError: Only indicator=False is currently supported So the current implementation of df.join how="anti" won't work.

https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.dataframe.merge/

Seemed like best to mark it as expected to fail for now. @MarcoGorelli what do you think would make sense here. Maybe a note in the docs and a follow up issue to see if we can implement it in another way for cuDF?

LiamConnors commented 1 month ago

Thanks @FBruzzesi! It does sound like it might be what we want. I'll try it out.

LiamConnors commented 1 month ago

Thanks @FBruzzesi! This has the tests passing now. I was wondering about this here: https://github.com/narwhals-dev/narwhals/pull/1041/files#diff-1bb9576cb79de69b2e45bcbc828669f54a6049dbdb76407986b3439ba309369dR481-R487 What situation would cause us to create extra columns in the join? Thanks for your feedback!

FBruzzesi commented 1 month ago

Thanks @FBruzzesi! This has the tests passing now.

Thanks! That's awesome 🚀

I was wondering about this here: https://github.com/narwhals-dev/narwhals/pull/1041/files#diff-1bb9576cb79de69b2e45bcbc828669f54a6049dbdb76407986b3439ba309369dR481-R487 What situation would cause us to create extra columns in the join? Thanks for your feedback!

In polars anti join does not add columns in other that match the keys (as it would not know where to add those). However, since the way we replicate that in pandas is via a outer join, columns would be added. To avoid such behavior we are considering solely the column in the keys for other. Does this help?

LiamConnors commented 1 month ago

Thanks @FBruzzesi! This has the tests passing now.

Thanks! That's awesome 🚀

I was wondering about this here: https://github.com/narwhals-dev/narwhals/pull/1041/files#diff-1bb9576cb79de69b2e45bcbc828669f54a6049dbdb76407986b3439ba309369dR481-R487 What situation would cause us to create extra columns in the join? Thanks for your feedback!

In polars anti join does not add columns in other that match the keys (as it would not know where to add those). However, since the way we replicate that in pandas is via a outer join, columns would be added. To avoid such behavior we are considering solely the column in the keys for other. Does this help?

I think so. Does that mean it might technically be possible to rework the cuDF implementation to not use it? Just passing the original other data frame to the cuDF constructor didn't work, so I was curious.

FBruzzesi commented 1 month ago

I think so. Does that mean it might technically be possible to rework the cuDF implementation to not use it? Just passing the original other data frame to the cuDF constructor didn't work, so I was curious.

From the cudf docs:

leftsemi : similar to inner join, but only returns columns from the left dataframe and ignores all columns from the right dataframe.
leftanti : returns only rows columns from the left dataframe for non-matched records. This is exact opposite to leftsemi join.

I would deduce that it is taken care of within the method itself, so possibly the step of slicing the columns and dropping duplicates is relevant for non CUDF implementation only

LiamConnors commented 1 month ago

I think so. Does that mean it might technically be possible to rework the cuDF implementation to not use it? Just passing the original other data frame to the cuDF constructor didn't work, so I was curious.

From the cudf docs:


leftsemi : similar to inner join, but only returns columns from the left dataframe and ignores all columns from the right dataframe.

leftanti : returns only rows columns from the left dataframe for non-matched records. This is exact opposite to leftsemi join.

I would deduce that it is taken care of within the method itself, so possibly the step of slicing the columns and dropping duplicates is relevant for non CUDF implementation only

Ah ok I see. Thanks for your help. I'll take another look at that. Sorry. That was my misunderstanding @FBruzzesi