huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

benchmarking against MMapIndexedDataset #1894

Open sshleifer opened 3 years ago

sshleifer commented 3 years ago

I am trying to benchmark my datasets based implementation against fairseq's MMapIndexedDataset and finding that, according to psrecord, my datasets implem uses about 3% more CPU memory and runs 1% slower for wikitext103 (~1GB of tokens).

Questions: 1) Is this (basically identical) performance expected? 2) Is there a scenario where this library will outperform MMapIndexedDataset? (maybe more examples/larger examples?) 3) Should I be using different benchmarking tools than psrecord/how do you guys do benchmarks?

Thanks in advance! Sam

lhoestq commented 3 years ago

Hi sam ! Indeed we can expect the performances to be very close since both MMapIndexedDataset and the datasets implem use memory mapping. With memory mapping what determines the I/O performance is the speed of your hard drive/SSD.

In terms of performance we're pretty close to the optimal speed for reading text, even though I found recently that we could still slightly improve speed for big datasets (see here).

In terms of number of examples and example sizes, the only limit is the available disk space you have.

I haven't used psrecord yet but it seems to be a very interesting tool for benchmarking. Currently for benchmarks we only have github actions to avoid regressions in terms of speed. But it would be cool to have benchmarks with comparisons with other dataset tools ! This would be useful to many people

lhoestq commented 3 years ago

Also I would be interested to know what data types MMapIndexedDataset supports. Is there some documentation somewhere ?

sshleifer commented 3 years ago

no docs haha, it's written to support integer numpy arrays.

You can build one in fairseq with, roughly:


wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
export dd=$HOME/fairseq-py/wikitext-103-raw

export mm_dir=$HOME/mmap_wikitext2
mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
for SPLIT in train valid; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs /scratch/stories_small/${SPLIT}.txt \
        --outputs /scratch/stories_small/${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

mkdir -p $mm_dir
fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref $dd/wiki.train.bpe \
    --validpref $dd/wiki.valid.bpe \
    --destdir $mm_dir \
    --workers 60 \
    --dataset-impl mmap

I'm noticing in my benchmarking that it's much smaller on disk than arrow (200mb vs 900mb), and that both incur significant cost by increasing the number of data loader workers. This somewhat old post suggests there are some gains to be had from using pyarrow.serialize(array).tobuffer(). I haven't yet figured out how much of this stuff pa.Table does under the hood.

The MMapIndexedDataset bottlenecks we are working on improving (by using arrow) are: 1) MMapIndexedDataset's index, which stores offsets, basically gets read in its entirety by each dataloading process. 2) we have separate, identical, MMapIndexedDatasets on each dataloading worker, so there's redundancy there; we wonder if there is a way that arrow can somehow dedupe these in shared memory.

It will take me a few hours to get MMapIndexedDataset benchmarks out of fairseq/onto a branch in this repo, but I'm happy to invest the time if you're interested in collaborating on some performance hacking.