[Feature Request] Using parquet files instead/alongside torch splits

snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning

https://ogb.stanford.edu

MIT License

1.89k stars 397 forks source link

[Feature Request] Using parquet files instead/alongside torch splits #377

Closed Dsantra92 closed 1 year ago

Dsantra92 commented 1 year ago

Hello devs. I am trying to develop support for OGB Datasets in MLDatasets.jl. One of the bottlenecks we are facing is loading the .pt files. This implementation here using Pickle.jl hack results in substantial memory usage compared to python. With new support for TorchArrow can you support parquet files for loading the splits?

weihua916 commented 1 year ago

Hi! Are the split files so large? They are just storing the split indices, no?

Dsantra92 commented 1 year ago

I was asking if it was possible/planned to use a language independent format to store the computed splits.

weihua916 commented 1 year ago

I see. That'd require all zipped files to be re-created. I do not think we will support this in the immediate future. You can probably consider some workaround on your side.

Dsantra92 commented 1 year ago

Makes sense!🙁