m3dev / gokart

Gokart solves reproducibility, task dependencies, constraints of good code, and ease of use for Machine Learning Pipeline.
https://gokart.readthedocs.io/en/latest/
MIT License
318 stars 57 forks source link

Load DataFrame cache with backward compatibility #381

Closed ujiuji1259 closed 4 months ago

ujiuji1259 commented 4 months ago

I have updated PickleFileProcessor to be able to load dataframes dumped in backward pandas versions.

background

Because the pandas dataframe dumped by pickle is not backward compatible, TargetOnKart cannot load the cache of dataframes dumped in backward pandas version. For example, TargetOnKart in pandas 2.1.0 cannot load the large dataframe dumped in pandas 1.5.3 (error: ModuleNotFoundError: No module named 'pandas.core.indexes.numeric').

Solution

In fact, pd.read_pickle can load the dataframe dumped by lower pandas version by patching pickle.load in pandas, so we can avoid the problem by using pd.read_pickle instead of dill.load. pd.read_pickle can also load any objects other than dataframes, even the objects dumped by dill, we can naturally use it in PickleFileProcessor.

Hi-king commented 4 months ago

@ujiuji1259 Nice API survey & documentation :) LGTM