pytorch / torcharrow

High performance model preprocessing library on PyTorch
https://pytorch.org/torcharrow/beta/index.html
BSD 3-Clause "New" or "Revised" License
645 stars 79 forks source link

add proper check _check_columns #419

Closed waitingkuo closed 2 years ago

waitingkuo commented 2 years ago

There're several functions that use _check_columns to check whether the input sequence of str are in dataframe's columns

    def _check_columns(self, columns: Iterable[str]):
          valid_names = {f.name for f in self.dtype.fields}
          for n in columns:
              if n not in valid_names:
                  raise TypeError(f"column {n} not among existing dataframe columns")

If we input the columns as a str, this function will loop the string and verify whether each character is in the columns or not.

e.g.

     import torcharrow as ta

     df = ta.dataframe(
        {"aa": list(range(7)), 
          "b": list(reversed(range(7))), 
           "c": list(range(7))
         })

     df.groupby('aa')

this raise

  TypeError: column a not among existing dataframe columns

But df.groupby('bb') didn't

this pr is to raise exception is the input itself is a string

waitingkuo commented 2 years ago

@wenleix some of the test cases failed reason:

https://github.com/pytorch/torcharrow/blob/main/torcharrow/velox_rt/dataframe_cpu.py#L2106

    def groupby(
        self,
        by: List[str],
        sort=False,
        drop_null=True,
    ):

the by parameters is List[str]

but in some test cases it input str instead, e.g. https://github.com/pytorch/torcharrow/blob/main/torcharrow/test/test_dataframe.py#L920

should we

  1. fix test cases (e.g. change df.groupby("a") to df.groupby(["a"])
  2. accept str as well (e.g. by: Union[str, List[str]) and wrap str as [str] if need (like what we discussed in #404 )
wenleix commented 2 years ago

Ah I see. For groupby, it makes sense to support single string, similar to Pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html. So I guess it's the option 2 in your proposal :)

waitingkuo commented 2 years ago

fixing _check_columns trigger some failed test cases for drop, `groupby, and drop_duplicates. i extended them to accept single str as column/subset

404 is enabled in this pr

waitingkuo commented 2 years ago

@wenleix squashed

wenleix commented 2 years ago

Merged #419 as https://github.com/pytorch/torcharrow/commit/a700ae1da664c82e1abdca63dfa7eeae05952ae9 . Thanks for the contribution!