Open sshleifer opened 3 years ago
Hi sam !
Indeed we can expect the performances to be very close since both MMapIndexedDataset and the datasets
implem use memory mapping. With memory mapping what determines the I/O performance is the speed of your hard drive/SSD.
In terms of performance we're pretty close to the optimal speed for reading text, even though I found recently that we could still slightly improve speed for big datasets (see here).
In terms of number of examples and example sizes, the only limit is the available disk space you have.
I haven't used psrecord
yet but it seems to be a very interesting tool for benchmarking. Currently for benchmarks we only have github actions to avoid regressions in terms of speed. But it would be cool to have benchmarks with comparisons with other dataset tools ! This would be useful to many people
Also I would be interested to know what data types MMapIndexedDataset
supports. Is there some documentation somewhere ?
no docs haha, it's written to support integer numpy arrays.
You can build one in fairseq with, roughly:
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
export dd=$HOME/fairseq-py/wikitext-103-raw
export mm_dir=$HOME/mmap_wikitext2
mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
for SPLIT in train valid; do \
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json gpt2_bpe/encoder.json \
--vocab-bpe gpt2_bpe/vocab.bpe \
--inputs /scratch/stories_small/${SPLIT}.txt \
--outputs /scratch/stories_small/${SPLIT}.bpe \
--keep-empty \
--workers 60; \
done
mkdir -p $mm_dir
fairseq-preprocess \
--only-source \
--srcdict gpt2_bpe/dict.txt \
--trainpref $dd/wiki.train.bpe \
--validpref $dd/wiki.valid.bpe \
--destdir $mm_dir \
--workers 60 \
--dataset-impl mmap
I'm noticing in my benchmarking that it's much smaller on disk than arrow (200mb vs 900mb), and that both incur significant cost by increasing the number of data loader workers.
This somewhat old post suggests there are some gains to be had from using pyarrow.serialize(array).tobuffer()
. I haven't yet figured out how much of this stuff pa.Table
does under the hood.
The MMapIndexedDataset
bottlenecks we are working on improving (by using arrow) are:
1) MMapIndexedDataset
's index, which stores offsets, basically gets read in its entirety by each dataloading process.
2) we have separate, identical, MMapIndexedDatasets
on each dataloading worker, so there's redundancy there; we wonder if there is a way that arrow can somehow dedupe these in shared memory.
It will take me a few hours to get MMapIndexedDataset
benchmarks out of fairseq
/onto a branch in this repo, but I'm happy to invest the time if you're interested in collaborating on some performance hacking.
I am trying to benchmark my datasets based implementation against fairseq's
MMapIndexedDataset
and finding that, according to psrecord, mydatasets
implem uses about 3% more CPU memory and runs 1% slower forwikitext103
(~1GB of tokens).Questions: 1) Is this (basically identical) performance expected? 2) Is there a scenario where this library will outperform
MMapIndexedDataset
? (maybe more examples/larger examples?) 3) Should I be using different benchmarking tools thanpsrecord
/how do you guys do benchmarks?Thanks in advance! Sam