Currently we load datasets with pd.read_csv from gzipped CSV format. Loading should be much improved by converting the data to parquet format and using pd.read_parquet (this might also reduce the size of downloads when using e.g. snappy compression).
Though the limitations of this approach is that datasets would need to be hosted somewhere and a new dependency (pyarrow) would need to be added. I'm not sure that it would be worth it.
Currently we load datasets with
pd.read_csv
from gzipped CSV format. Loading should be much improved by converting the data to parquet format and usingpd.read_parquet
(this might also reduce the size of downloads when using e.g. snappy compression).Though the limitations of this approach is that datasets would need to be hosted somewhere and a new dependency (pyarrow) would need to be added. I'm not sure that it would be worth it.