huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

wiki_dpr pre-processing performance #1670

Open dbarnhart opened 3 years ago

dbarnhart commented 3 years ago

I've been working with wiki_dpr and noticed that the dataset processing is seriously impaired in performance [1]. It takes about 12h to process the entire dataset. Most of this time is simply loading and processing the data, but the actual indexing is also quite slow (3h).

I won't repeat the concerns around multiprocessing as they are addressed in other issues (#786), but this is the first obvious thing to do. Using cython to speed up the text manipulation may be also help. Loading and processing a dataset of this size in under 15 minutes does not seem unreasonable on a modern multi-core machine. I have hit such targets myself on similar tasks. Would love to see this improve.

The other issue is that it takes 3h to construct the FAISS index. If only we could use GPUs with HNSW, but we can't. My sharded GPU indexing code can build an IVF + PQ index in 10 minutes on 20 million vectors. Still, 3h seems slow even for the CPU.

It looks like HF is adding only 1000 vectors at a time by default [2], whereas the faiss benchmarks adds 1 million vectors at a time (effectively) [3]. It's possible the runtime could be reduced with a larger batch. Also, it looks like project dependencies ultimately use OpenBLAS, but this is known to have issues when combined with OpenMP, which HNSW does [3]. A workaround is to set the environment variable OMP_WAIT_POLICY=PASSIVE via os.environ or similar.

References: [1] https://github.com/huggingface/datasets/blob/master/datasets/wiki_dpr/wiki_dpr.py [2] https://github.com/huggingface/datasets/blob/master/src/datasets/search.py [3] https://github.com/facebookresearch/faiss/blob/master/benchs/bench_hnsw.py [4] https://github.com/facebookresearch/faiss/issues/422

lhoestq commented 3 years ago

Hi ! And thanks for the tips :)

Indeed currently wiki_dpr takes some time to be processed. Multiprocessing for dataset generation is definitely going to speed up things.

Regarding the index note that for the default configurations, the index is downloaded instead of being built, which avoid spending time on constructing the index. However in other cases it would be awesome to make the construction faster.

Any contribution that can help things faster are welcome. In particular in you have some code that can build a wiki_dpr IVF PQ index in a sharded GPU setup and would like to share it, we can add it to an examples folder. In particular since faiss is becoming the library of reference for dataset indexing for tasks like Open Domain Question Answering.

dbarnhart commented 3 years ago

I'd be happy to contribute something when I get the time, probably adding multiprocessing and / or cython support to wiki_dpr. I've written cythonized apache beam code before as well.

For sharded index building, I used the FAISS example code for indexing 1 billion vectors as a start. I'm sure you're aware that the documentation isn't great, but the source code is fairly easy to follow.

lhoestq commented 3 years ago

Nice thanks ! That would be awesome to make its construction faster :)