Closed waitingkuo closed 2 years ago
@wenleix some of the test cases failed reason:
https://github.com/pytorch/torcharrow/blob/main/torcharrow/velox_rt/dataframe_cpu.py#L2106
def groupby(
self,
by: List[str],
sort=False,
drop_null=True,
):
the by parameters is List[str]
but in some test cases it input str
instead, e.g.
https://github.com/pytorch/torcharrow/blob/main/torcharrow/test/test_dataframe.py#L920
should we
df.groupby("a")
to df.groupby(["a"])
by: Union[str, List[str]
) and wrap str as [str] if need (like what we discussed in #404 )Ah I see. For groupby
, it makes sense to support single string, similar to Pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html. So I guess it's the option 2 in your proposal :)
fixing _check_columns trigger some failed test cases for drop
, `groupby
, and drop_duplicates
. i extended them to accept single str as column/subset
@wenleix squashed
Merged #419 as https://github.com/pytorch/torcharrow/commit/a700ae1da664c82e1abdca63dfa7eeae05952ae9 . Thanks for the contribution!
There're several functions that use _check_columns to check whether the input sequence of str are in dataframe's columns
If we input the columns as a str, this function will loop the string and verify whether each character is in the columns or not.
e.g.
this raise
But df.groupby('bb') didn't
this pr is to raise exception is the input itself is a string