tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.23k stars 1.52k forks source link

Please support prefetch with python datasets #5323

Open bionicles opened 3 months ago

bionicles commented 3 months ago

Is your feature request related to a problem? Please describe. There's a tremendous performance difference between datasets which are fully tensor end-to-end and datasets where some data wrangling happens in Python.

I was hoping to use "prefetch" to prepare data with the CPU while the GPU does work, but unfortunately, this only works if the data preparation is fully tensorflow-ish (? not sure the right term here)

Python IO operations are often exponentially slower and act as a bottleneck and prevent computers from keeping accelerators working at capacity.

Describe the solution you'd like I wish tf.data.Dataset prefetch was more broadly compatible with prefetching of non-tensorflow vanilla-python data preparations.

Would it be possible for prefetch to use some performant C++ to sidestep Python GIL issues and juggle python data wrangling CPU processes alongside GPU training / inference without depending on such python CPU work happening in the main python driver process? I just want to be able to prefetch custom python datasets. Often there's some prep involved, not every dataset is tensorflow end-to-end.

i.e. instead of (python)->(gpu) what if it were

(python)->(cpp) (cpp)->(python_prefetch) (cpp)->(gpu/tpu accelerator)

Since C++ lacks a GIL, it could just run the python generator in a process which is isolated from the main python driver process' GIL, you'd still have a GIL per generator, but that's an easy fix, just run more python processes with different random seeds etc.

Describe alternatives you've considered Torch DataLoader could be an option but it also seems to be a python-driven solution and therefore not super performant. I tried Threading, but pickling and unpickling overhead in python can be pretty bad. I think maybe C++ could run a background python process to prefetch python data in a more performant way than python could.

Additional context Broader accessibility of custom dataset prefetching could enable new use-cases for tf.data especially in prototyping or infinite search spaces where it might not make sense to convert entire datasets to tensors in advance.

Apologies if I misunderstand the intricacies involved. I just want to prefetch datasets built from generators. I tried doing this last week but it didn't work, so hopefully I'm not mentioning an issue which is fixed and I missed a way to pull it off. It's hard to provide an example since the code in question is closed-source and quite extensive anyway. For a good example of when this might be handy, consider RL gym envs or datasets which involve making GET requests.

tomvdw commented 3 months ago

Do I assume correctly that you're using tfds.data_source to load the data? If so, one option is to use Grain to load your data. IIUC Grain does prefetch the data. If you're doing random access, prefetching is hard because you don't know what record will be loaded next.

bionicles commented 3 months ago

Thank you @tomvdw i will check that out! Just eyeballing it, one way to make grain more accessible would be to add usage examples to the readme so folks who look at the repo can get the gestalt. I tried clicking the docs link on the GitHub iOS app and it took me to a folder of code, so I'll take a look in there.

I meekly suggest code examples above the fold are great advertising for any repo. Thank you for sharing