huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.98k stars 2.62k forks source link

[📄 Docs] Create a `datasets` performance guide. #3829

Open dynamicwebpaige opened 2 years ago

dynamicwebpaige commented 2 years ago

Brief Overview

Downloading, saving, and preprocessing large datasets from the datasets library can often result in performance bottlenecks. These performance snags can be challenging to identify and to debug, especially for users who are less experienced with building deep learning experiments.

Feature Request

Could we create a performance guide for using datasets, similar to:

This performance guide should detail practical options for improving performance with datasets, and enumerate any common best practices. It should also show how to use tools like the PyTorch Profiler or the TF Profiler to identify any performance bottlenecks (example below).

image

Related Issues

lhoestq commented 2 years ago

Hi ! Yes this is definitely something we'll explore, since optimizing processing pipelines can be challenging and because performance is key here: we want anyone to be able to play with large-scale datasets more easily.

I think we'll start by documenting the performance of the dataset transforms we provide, and then we can have some tools to help debugging/optimizing them