snap-stanford / relbench

RelBench: Relational Deep Learning Benchmark
https://relbench.stanford.edu
MIT License
181 stars 31 forks source link

Remove Pandas dependency #10

Closed rishabh-ranjan closed 10 months ago

rishabh-ranjan commented 10 months ago

Currently Pandas is an unnecessary middleman. It provides no functionality other than storing tables as pd.DataFrame.

Also, it is very slow to process and manipulate data. And leads to unnecessary complexity in maintaining pandas dtypes (like pd.datetime64[ns] for time_col) in addition to PyF stypes, python datatypes, pyarrow/parquet schema datatypes.

PyF works with dataframes, but we can do that conversion in the to_pyf util itself.

I suggest using PyArrow Tables directly everywhere. Or simple python objects (dict of list) in the class attributes and PyArrow to manipulate/store data in the function implementations.

@rusty1s @kexinhuang12345 thoughts?

I am thinking about this now because I am preparing the Amazon dataset for further use. It's a big dataset (>200M reviews) and I am observing Pandas to be a huge bottleneck.

rishabh-ranjan commented 10 months ago

Having to @kexinhuang12345, the approach I am taking is to use a smaller subset of the Amazon dataset (limiting to the Amazon Fashion category with ~900k reviews), and leaving Pandas as it is in the code, to avoid having to rewrite a lot.

rusty1s commented 10 months ago

2 cents: I would avoid storing this in a custom dict of lists and keep the pandas dependeny. Note that pandas>=2 also supports pyarrow backend. Often times, if you feel that pandas is slow, there usually exists an alternative and more efficient way to write things :)