hwijeen commented 3 years ago

Hi,

I reported too slow data fetching when data is large(#2210) a couple of weeks ago, and @lhoestq referred me to the fix (#2122). However, the problem seems to persist. Here is the profiled results:

1) Running with 60GB

Action                              |  Mean duration (s)    |Num calls          |  Total time (s)   |  Percentage %     |
------------------------------------------------------------------------------------------------------------------------------------
Total                               |  -                |_                  |  517.96           |  100 %            |
------------------------------------------------------------------------------------------------------------------------------------
model_backward                      |  0.26144          |100                |  26.144           |  5.0475           |
model_forward                       |  0.11123          |100                |  11.123           |  2.1474           |
get_train_batch                     |  0.097121         |100                |  9.7121           |  1.8751           |

3) Running with 600GB, datasets==1.6.0

Action                              |  Mean duration (s)    |Num calls          |  Total time (s)   |  Percentage %     |
------------------------------------------------------------------------------------------------------------------------------------
Total                               |  -                |_                  |  4563.2           |  100 %            |
------------------------------------------------------------------------------------------------------------------------------------
get_train_batch                     |  5.1279           |100                |  512.79           |  11.237           |
model_backward                      |  4.8394           |100                |  483.94           |  10.605           |
model_forward                       |  0.12162          |100                |  12.162           |  0.26652          |

I see that get_train_batch lags when data is large. Could this be related to different issues? I would be happy to provide necessary information to investigate.

lhoestq commented 3 years ago

Hi ! Sorry to hear that. This may come from another issue then.

First can we check if this latency comes from the dataset itself ? You can try to load your dataset and benchmark the speed of querying random examples inside it ?

import time
import numpy as np

from datasets import load_from_disk

dataset = load_from_disk(...) # or from load_dataset...

_start = time.time()
n = 100
for i in np.random.default_rng(42).integers(0, len(dataset), size=n):
    _ = dataset[i]
print(time.time() - _start)

If we see a significant speed difference between your two datasets then it would mean that there's an issue somewhere

hwijeen commented 3 years ago

Hi @lhoestq, here is the result. I additionally measured time to load_from_disk:

60GB

loading took:  22.618776321411133
ramdom indexing 100 times took: 0.10214924812316895

600GB

loading took:  1176.1764674186707
ramdom indexing 100 times took: 2.853600025177002

Hmm.. I double checked that it's version 1.6.0. The difference seems quite big, could it be related to the running environment?

lhoestq commented 3 years ago

I'm surprised by the speed change. Can you give more details about your dataset ? The speed depends on the number of batches in the arrow tables and the distribution of the lengths of the batches. You can access the batches by doing dataset.data.to_batches() (use only for debugging) (it doesn't bring data in memory).

Also can you explain what parameters you used if you used map calls ? Also if you have some code that reproduces the issue I'd be happy to investigate it.

lhoestq commented 3 years ago

Also if you could give us more info about your env like your OS, version of pyarrow and if you're using an HDD or a SSD

hwijeen commented 3 years ago

Here are some details of my 600GB dataset. This is a dataset AFTER the map function and once I load this dataset, I do not use map anymore in the training. Regarding the distribution of the lengths, it is almost uniform (90% is 512 tokens, and 10% is randomly shorter than that -- typical setting for language modeling).

len(batches):
492763

batches[0]: 
pyarrow.RecordBatch
attention_mask: list<item: uint8>
  child 0, item: uint8
input_ids: list<item: int16>
  child 0, item: int16
special_tokens_mask: list<item: uint8>
  child 0, item: uint8
token_type_ids: list<item: uint8>
  child 0, item: uint8

Here the some parameters to map function just in case it is relevant:

num_proc=1    # as multi processing is slower in my case
load_from_cache_file=False

hwijeen commented 3 years ago

Regarding the environment, I am running the code on a cloud server. Here are some info:

Ubuntu 18.04.5 LTS   # cat /etc/issue
pyarrow                 3.0.0  # pip list | grep pyarrow

The data is stored in SSD and it is mounted to the machine via Network File System.

If you could point me to some of the commands to check the details of the environment, I would be happy to provide relevant information @lhoestq !

hwijeen commented 3 years ago

I am not sure how I could provide you with the reproducible code, since the problem only arises when the data is big. For the moment, I would share the part that I think is relevant. Feel free to ask me for more info.

class MyModel(pytorch_lightning.LightningModule)
    def setup(self, stage):
        self.dataset = datasets.load_from_disk(path)
        self.dataset.set_format("torch")

    def train_dataloader(self):
        collate_fn = transformers.DataCollatorForLanguageModeling(
                tokenizer=transformers.ElectraTokenizerFast.from_pretrained(tok_path)
        )
        dataloader = torch.utils.DataLoader(
                self.dataset,
                batch_size=32,
                collate_fn=collate_fn,
                num_workers=8,
                pin_memory=True,
       )

lhoestq commented 3 years ago

Hi ! Sorry for the delay I haven't had a chance to take a look at this yet. Are you still experiencing this issue ? I'm asking because the latest patch release 1.6.2 fixed a few memory issues that could have lead to slow downs

hwijeen commented 3 years ago

Hi! I just ran the same code with different datasets (one is 60 GB and another 600 GB), and the latter runs much slower. ETA differs by 10x.

BenoitDalFerro commented 3 years ago

@lhoestq and @hwijeen

Despite upgrading to datasets 1.6.2, still experiencing extremely slow (2h00) loading for a 300Gb local dataset shard size 1.1Gb on local HDD (40Mb/s read speed). This corresponds almost exactly to total data divided by reading speed implying that it reads the entire dataset at each load.

Stack details:

GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect

Python version: 3.7 (64-bit runtime) Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: GeForce GTX 1050 Nvidia driver version: 457.63 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] datasets==1.6.2 [pip3] transformers==4.5.1 [pip3] numpy==1.19.1 [pip3] numpydoc==1.1.0 [pip3] pytorch-metric-learning==0.9.98 [pip3] torch==1.8.1 [pip3] torchaudio==0.8.1 [pip3] torchvision==0.2.2 [conda] blas 2.16 mkl conda-forge [conda] cudatoolkit 10.2.89 hb195166_8 conda-forge [conda] libblas 3.8.0 16_mkl conda-forge [conda] libcblas 3.8.0 16_mkl conda-forge [conda] liblapack 3.8.0 16_mkl conda-forge [conda] liblapacke 3.8.0 16_mkl conda-forge [conda] mkl 2020.1 216 [conda] numpy 1.19.1 py37hae9e721_0 conda-forge [conda] numpydoc 1.1.0 py_1 conda-forge [conda] pytorch 1.8.1 py3.7_cuda10.2_cudnn7_0 pytorch [conda] pytorch-metric-learning 0.9.98 pyh39e3cac_0 metric-learning [conda] torchaudio 0.8.1 py37 pytorch [conda] torchvision 0.2.2 py_3 pytorch

lhoestq commented 3 years ago

Hi @BenoitDalFerro how do your load your dataset ?

BenoitDalFerro commented 3 years ago

Hi @lhoestq thanks for the quick turn-around, actually the plain vanilla way, without an particular knack or fashion, I tried to look into the documentation for some alternative but couldn't find any

dataset = load_from_disk(dataset_path=os.path.join(datasets_dir,dataset_dir))

tsproisl commented 3 years ago

I’m facing the same issue when loading a 900GB dataset (stored via save_to_disk): load_from_disk(path_to_dir) takes 1.5 hours and htop consistently shows high IO rates > 120 M/s.

BenoitDalFerro commented 3 years ago

@tsproisl same here, smells like ~~teen spirit~~ intended generator inadvertently ending up iterator

@lhoestq perhaps solution to detect bug location in code is to track its signature via HD read usage monitoring, option is to add tracking decorator on top each function and sequentially close all hatches from top to bottom, suggest PySmart https://pypi.org/project/pySMART/ a Smartmontools implementation

lhoestq commented 3 years ago

I wasn't able to reproduce this on a toy dataset of around 300GB:

import datasets as ds

s = ds.load_dataset("squad", split="train")
s4000 = ds.concatenate_datasets([s] * 4000)
print(ds.utils.size_str(s4000.data.nbytes))  # '295.48 GiB'

s4000.save_to_disk("tmp/squad_4000")

import psutil
import time
from datasets import load_from_disk

disk = "disk0"  # You may have to change your disk here
iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
time1 = time.time()

s4000_reloaded = load_from_disk("tmp/squad_4000")

time2 = time.time()
iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]

print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}")  # Blocks read 18
print(f"Elapsed time: {time2 - time1:.02f}s")  # Elapsed time: 14.60s

Could you run this on your side and tell me if how much time it takes ? Please run this when your machine is idle so that other processes don't interfere.

I got these results on my macbook pro on datasets 1.6.2

BenoitDalFerro commented 3 years ago

@lhoestq thanks, test running as we speak, bear with me

lhoestq commented 3 years ago

Just tried on google colab and got ~1min for a 15GB dataset (only 200 times SQuAD), while it should be instantaneous. The time is spent reading the Apache Arrow table from the memory mapped file. This might come a virtual disk management issue. I'm trying to see if I can still speed it up on colab.

BenoitDalFerro commented 3 years ago

@lhoestq what is Google Colab's HD read speed, is it possible to introspect incl. make like SSD or HDD ?

tsproisl commented 3 years ago

@lhoestq Thank you! The issue is getting more interesting. The second script is still running, but it's definitely taking much longer than 15 seconds.

tsproisl commented 3 years ago

Okay, here’s the ouput: Blocks read 158396 Elapsed time: 529.10s

Also using datasets 1.6.2. Do you have any ideas, how to pinpoint the problem?

BenoitDalFerro commented 3 years ago

@lhoestq, @tsproisl mmmh still writing on my side about 1h to go, thinking on it are your large datasets all monoblock unsharded ? mine is 335 times 1.18Gb shards.

tsproisl commented 3 years ago

The 529.10s was a bit too optimistic. I cancelled the reading process once before running it completely, therefore the harddrive cache probably did its work.

Here are three consecutive runs First run (freshly written to disk): Blocks read 309702 Elapsed time: 1267.74s Second run (immediately after): Blocks read 113944 Elapsed time: 417.55s Third run (immediately after): Blocks read 42518 Elapsed time: 199.19s

BenoitDalFerro commented 3 years ago

@lhoestq First test

elapsed time: 11219.05s

Second test running bear with me, for Windows users slight trick to modify original "disk0" string:

First find physical unit relevant key in dictionnary

import psutil
psutil.disk_io_counters(perdisk=True)

{'PhysicalDrive0': sdiskio(read_count=18453286, write_count=4075333, read_bytes=479546467840, write_bytes=161590275072, read_time=20659, write_time=2464), 'PhysicalDrive1': sdiskio(read_count=1495778, write_count=388781, read_bytes=548628622336, write_bytes=318234849280, read_time=426066, write_time=19085)}

In my case it's PhysicalDrive1

Then insert relevant key's string as disk variable

psutil.disk_io_counters()
disk = 'PhysicalDrive1'  # You may have to change your disk here
iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
time1 = time.time()
s4000_reloaded = load_from_disk("your path here")
time2 = time.time()
iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]
print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}")  # Blocks read 18
print(f"Elapsed time: {time2 - time1:.02f}s")  # Elapsed time: 14.60s

BenoitDalFerro commented 3 years ago

@lhoestq Second test

Blocks read 1265609 Elapsed time: 11216.55s

BenoitDalFerro commented 3 years ago

@lhoestq any luck ?

lhoestq commented 3 years ago

Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no read operations which enables a very fast loading.

I did some tests on google colab and have the same issue. The first time the dataset arrow file is memory mapped takes always a lot of time (time seems linear with respect to the dataset size). Reloading the dataset is then instantaneous since the arrow file has already been memory mapped.

I also tried using the Arrow IPC file format (see #1933) instead of the current streaming format that we use but it didn't help.

Memory mapping is handled by the OS and depends on the disk you're using, so I'm not sure we can do much about it. I'll continue to investigate anyway, because I still don't know why in some cases it would go through the entire file (high Blocks read as in your tests) and in other cases it would do almost no reading.

BenoitDalFerro commented 3 years ago

@lhoestq thanks for the effort, let's stay in touch

gurvindersingh commented 3 years ago

Just want to say that I am seeing the same issue. Dataset size if 268GB and it takes 3 hours to load load_from_disk, using dataset version 1.9.0. Filesystem underneath is Lustre

BenoitDalFerro commented 3 years ago

Hi @lhoestq, confirmed Windows issue, exact same code running on Linux OS total loading time about 3 minutes.

hwijeen commented 3 years ago

Hmm that's different from what I got. I was on Ubuntu when reporting the initial issue.

arsarabi commented 3 years ago

Hi,

I'm experiencing the same issue where it is taking 40 minutes (with 500GB read from disk) to load a 220GB dataset stored using save_to_disk. I'm loading from a SSD on Ubuntu 18.04, datasets 1.11.0, and pyarrow 5.0.0. After profiling load_from_disk and digging a bit into pyarrow I've found the following.

About 60% of the load time is spent in pyarrow.Table.cast used within update_metadata_with_features. Since update_metadata_with_features is only updating the schema metadata, wouldn't it be better to use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast? Doing this makes update_metadata_with_features almost instantaneous.
The remaining time is spent in pyarrow.ipc.RecordBatchStreamReader.read_all. Looking at Arrow C++ libraries, I see posix_madvise system calls with MADV_WILLNEED which seem to make the OS load the entire file into the page cache.

I have limited knowledge about memory mapping and page cache so I'm not sure if I'm on the right track here. But at least for me modifiying update_metadata_with_features to use pyarrow.Table.replace_schema_metadata reduces the load time to 16 mintures with 220GB read from disk. Modifying/recompiling Arrow to suppress the posix_madvise system calls further reduces the load time to under 4 mintures with 40GB read from disk. I'm not sure if there is an easy fix for the latter without manually modifing the Arrow C++ libraries.

I've also tried loading the dataset on a MacBook and while it loads faster with fewer reads (meaning part of the issue is OS-specific), the problem with update_metadata_with_features being slow remains.

lhoestq commented 3 years ago

Hi @arsarabi ! Thanks for investigating, these are useful insights. Good catch about Table.cast, feel free to open a PR to use Table.replace_schema_metadata instead (plus it looks like it's not considered "experimental" anymore since pyarrow 5.0.0)

Regarding the posix_madvise system calls with MADV_WILLNEED, it's probably worth discussing about it on the Apache Arrow mailing list.

moinnadeem commented 2 years ago

Hey all,

I've been plagued by this issue when training large language models on C4 (2.1TB Arrow file after tokenization), so wanted to spend some time doing a deep dive.

I took @lhoestq's test script, and modified it to 1000 times SQUaD so it could fit on a variety of machines I tested

import datasets as ds

s = ds.load_dataset("squad", split="train")
s4000 = ds.concatenate_datasets([s] * 1000)
print(ds.utils.size_str(s4000.data.nbytes))  # '73.90 GiB'

s4000.save_to_disk("tmp/squad_4000")

Finally, loading this dataset on my MacBook Pro yielded a similar time as @lhoestq, confirming his results:

In [3]: import psutil
   ...: import time
   ...: from datasets import load_from_disk
   ...:
   ...: disk = "disk0"  # You may have to change your disk here
   ...: iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
   ...: time1 = time.time()
   ...:
   ...: s4000_reloaded = load_from_disk("tmp/squad_4000")
   ...:
   ...: time2 = time.time()
   ...: iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]
   ...:
   ...: print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}")  # Blocks read 18
   ...: print(f"Elapsed time: {time2 - time1:.02f}s")  # Elapsed time: 14.60s

Blocks read 9242
Elapsed time: 4.62s

Next, I tried the same experiment on my desktop running Ubuntu 20.04 with an SSD and a 7200 RPM HDD:

Ubuntu SSD results:

In [10]: iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
    ...: time1 = time.time()
    ...:
    ...: s4000_reloaded = load_from_disk("tmp/squad_4000")
    ...:
    ...: time2 = time.time()
    ...: iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]
    ...:
    ...: print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}")  # Blocks read 18
    ...: print(f"Elapsed time: {time2 - time1:.02f}s")  # Elapsed time: 14.60s
Blocks read 26718
Elapsed time: 29.23s

Ubuntu HDD results:

In [6]: iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
   ...: time1 = time.time()
   ...:
   ...: s4000_reloaded = load_from_disk("tmp/squad_4000")
   ...:
   ...: time2 = time.time()
   ...: iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]
   ...:
   ...: print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}")  # Blocks read 18
   ...: print(f"Elapsed time: {time2 - time1:.02f}s")  # Elapsed time: 14.60s
Blocks read 24433
Elapsed time: 116.97s

I also tracked the read speed of my HDD and SSD while the file was reading using IOstat. The SSD read the Arrow file at 429 M/s, and the HDD read it at 92 M/s. This factor correlates with the difference in elapsed time.

Finally, I have viewed the block size for all platforms, and they are all 4096 bytes. Some questions that I have:

Why does the Ubuntu machine read many more blocks than the OSX machine? It isn't a virtualized disk for this experiment, so memory mapping should be doing something similar.
I thought Arrow files should have been memory mapped into the operating system, which implied to me that the time to load a dataset shouldn't correlate with the number of tokens in the dataset. Is this true? As I increase the size of the dataset (the factor that we multiply SQUaD by), the time for load_from_disk increases. This is at the point where it takes ~4 hours to initialize a job that loads a copy of C4, which is very cumbersome to experiment with.

Any thoughts would be appreciated! This is currently a blocker in my workload.

hwijeen commented 2 years ago

Although I hope this issue get resolved fundamentally, the following could be a workaround.

split your corpus into many small sized files, say 10GB.
create one arrow file for each small sized file

use Pytorch's ConcatDataset to load a bunch of datasets

dataset_list = [datasets.load_from_disk(p) for p in dataset_paths]
for ds in dataset_list:
ds.set_format("torch")
dataset = torch.utils.data.ConcatDataset(dataset_list)  # use this for training

I did not measured the exact time difference, but in my case this approach made training with 600GB data feasible.

moinnadeem commented 2 years ago

@hwijeen Ooh, thank you! Although I am curious -- if you are loading shards in serial (as opposed to parallel), shouldn't the loading time be the same as loading the entire dataset? I would understand if one was loading the dataset with multiprocessing, but wanted to check on the code snippet you gave.

hwijeen commented 2 years ago

@moinnadeem It seems the loading time does not increase linearly according to the dataset size, like it is shown in the above comment.

lhoestq commented 2 years ago

Hi !

I thought Arrow files should have been memory mapped into the operating system, which implied to me that the time to load a dataset shouldn't correlate with the number of tokens in the dataset. Is this true? As I increase the size of the dataset (the factor that we multiply SQUaD by), the time for load_from_disk increases. This is at the point where it takes ~4 hours to initialize a job that loads a copy of C4, which is very cumbersome to experiment with.

Yes in theory it shouldn't be correlated to the size of the dataset, in my understanding of pyarrow. Here is the function that is called to load the Arrow table of a dataset. It takes a path to an Arrow file (in arrow streaming format) as input:

import pyarrow as pa

def _memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
    memory_mapped_stream = pa.memory_map(filename)
    opened_stream = pa.ipc.open_stream(memory_mapped_stream)
    pa_table = opened_stream.read_all()
    return pa_table

The part that can sometimes take a long time is the call to opened_stream.read_all(). Not sure what pyarrow does at that point to cause this though. In my opinion we should discuss with the Arrow team to figure out what's happening.

Although we can probably workaround this in datasets by systematically sharding/partitionning the datasets on disk if that helps.

alexcoca commented 2 years ago

This has been a very interesting discussion to read. Are there any updates on it? I take it that the best option we have now is to shard our data into multiple datasets and concatenate them as shown above by @hwijeen.

On a more basic note, the huggingface Dataset should play really well with the DistributedSampler object used by the Trainer when doing e.g., DDP? I'm hoping that by using the huggingface Dataset, the data loader will just index into the pyarrow table and the dataset won't be loaded in full in each process (but we have to pay the cost of the load_data in each process presumably so that the data loader can index into the table on that process)? Then I can use the data collators and other tools available in the library to do multi-node training. Is my intuition correct?

lhoestq commented 2 years ago

This has been a very interesting discussion to read. Are there any updates on it? I take it that the best option we have now is to shard our data into multiple datasets and concatenate them as shown above by @hwijeen.

If this solution proves to help, we can add an arrow files sharding for all big datasets directly integrated in load_dataset.

I'm hoping that by using the huggingface Dataset, the data loader will just index into the pyarrow table and the dataset won't be loaded in full in each process (but we have to pay the cost of the load_data in each process presumably so that the data loader can index into the table on that process)?

Yes your intuition is right :)

VictorSanh commented 2 years ago

By any chance, do we have a better understanding of what's happening?

I am encoutering a similar problem: I have an arrow file produced by HF datasets (shard+save_to_disk) and I am trying to load this dataset/arrow file with datasets.load_from_disk(the_dataset_folder). I noticed that the first time I load it, it would be significantly slower than the subsequent times. Two days later, I will retry loading it, and it will be slow again...

After diving a little bit, the gap happens in the _memory_mapped_arrow_table_from_file function, and in particular in the call to RecordBatchStreamReader.read_all:https://github.com/huggingface/datasets/blob/158917e24128afbbe0f03ce36ea8cd9f850ea853/src/datasets/table.py#L51 read_all is slow the first time (probably for some operations that are only happening once, and are cached for a few hours?), but not the subsequent times.

>>> def _memory_mapped_arrow_table_from_file(filename):
...     memory_mapped_stream = pa.memory_map(filename)
...     opened_stream = pa.ipc.open_stream(memory_mapped_stream)
...     start_time = time.time()
...     _ = opened_stream.read_all()
...     print(f"{time.time()-start_time}")
...
>>> filename_slow = "train/00248-00249/cache-3d25861de64b93b5.arrow"
>>> _memory_mapped_arrow_table_from_file(filename_slow) # First time
0.24040865898132324
>>> _memory_mapped_arrow_table_from_file(filename_slow) # subsequent times
0.0006551742553710938
>>> _memory_mapped_arrow_table_from_file(filename_slow)
0.0006804466247558594
>>> _memory_mapped_arrow_table_from_file(filename_slow)
0.0009818077087402344

My setup:

datasets version: 2.3.3.dev0
Platform: Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.17
Python version: 3.8.13
PyArrow version: 9.0.0
Pandas version: 1.4.2

I realize this might be an Apache Arrow question so I ask them, but wanted to leave a message here too.

lhoestq commented 2 years ago

This is a good question to ask Arrow, feel free to share the thread link or the response here :)

VictorSanh commented 2 years ago

For posterity, the answer from the apache arrow folks:

The short answer is no, you cannot remove that discrepancy.

For a memory mapped file, when data is first accessed it is brought into memory. Subsequent reads to that data doesn't require having to go back to disk, because it's already in memory. In your example, you haven't restarted your process so the file data is still in memory for the subsequent reads.

If you want more details about memory mapped files, I think this SO post seems to have some pretty good info 1.

pauli31 commented 1 year ago

I have the same issue still with the Oscar dataset... any solution, I was thinking about streaming the dataset but that's not an option for me now....

lhoestq commented 1 year ago

In release 2.6 we significantly improved the speed of iterating over contiguous data (from https://github.com/huggingface/datasets/pull/5030). Just make sure to use a for loop or load the dataset as an iterable dataset.

We also opened https://github.com/huggingface/datasets/issues/5265 to be able to switch from a map-style dataset to an iterable dataset and achieve the best throughput.

pauli31 commented 1 year ago

Hey, thank you. I'm using the default Trainer for MLM pre-training, I can load the dataset as IterableDataset (with the streaming option), but then I can not use the IterableDataset.take() and IterableDataset.skip() to split the dataset (I would like to have some validation data) because the dataset is divided in shards and once I run the training I got exception that take and skip methods cannot be used in that configuration.

lhoestq commented 1 year ago

If you have the dataset locally you can load it as an iterable dataset to get the best performance. Switching from a Dataset to an IterableDataset isn't implemented in datasets yet (see https://github.com/huggingface/datasets/issues/5265 mentioned above), but is pretty easy to do using IterableDataset.from_generator.

You just need to get shards of the dataset and choose some of them for training, and some of them for validation. Here is an example where you shard the dataset in 100 parts and choose the last one to be your validation set:

from datasets import load_dataset, IterableDataset

oscar = load_dataset("oscar", split="train")

# to get the best speed we don't shuffle the dataset before sharding, and we load shards of contiguous data
num_shards = 100
shards = [oscar.shard(num_shards=num_shards, index=index, contiguous=True) for index in range(num_shards)]

def gen_from_shards(shards):
    for shard in shards:
        for example in shard:
            yield example

train_ds = IterableDataset.from_generator(gen_from_shards, gen_kwargs={"shards": shards[:99]})
val_ds = IterableDataset.from_generator(gen_from_shards, gen_kwargs={"shards": shards[99:]})

pauli31 commented 1 year ago

If you have the dataset locally you can load it as an iterable dataset to get the best performance. Switching from a Dataset to an IterableDataset isn't implemented in datasets yet (see #5265 mentioned above), but is pretty easy to do using IterableDataset.from_generator.

You just need to get shards of the dataset and choose some of them for training, and some of them for validation. Here is an example where you shard the dataset in 100 parts and choose the last one to be your validation set:
from datasets import load_dataset, IterableDataset

oscar = load_dataset("oscar", split="train")

# to get the best speed we don't shuffle the dataset before sharding, and we load shards of contiguous data
num_shards = 100
shards = [oscar.shard(num_shards=num_shards, index=index, contiguous=True) for index in range(num_shards)]

def gen_from_shards(shards):
    for shard in shards:
        for example in shard:
            yield example

train_ds = IterableDataset.from_generator(gen_from_shards, gen_kwargs={"shards": shards[:99]})
val_ds = IterableDataset.from_generator(gen_from_shards, gen_kwargs={"shards": shards[99:]})

Thank you I'll try it!

lhoestq commented 1 year ago

Please also make sure to use the latest version of pyarrow to benefit from the best speed, or at least pyarrow 8.0.0 :)

pauli31 commented 1 year ago

If you have the dataset locally you can load it as an iterable dataset to get the best performance. Switching from a Dataset to an IterableDataset isn't implemented in datasets yet (see #5265 mentioned above), but is pretty easy to do using IterableDataset.from_generator.

You just need to get shards of the dataset and choose some of them for training, and some of them for validation. Here is an example where you shard the dataset in 100 parts and choose the last one to be your validation set:
from datasets import load_dataset, IterableDataset

oscar = load_dataset("oscar", split="train")

# to get the best speed we don't shuffle the dataset before sharding, and we load shards of contiguous data
num_shards = 100
shards = [oscar.shard(num_shards=num_shards, index=index, contiguous=True) for index in range(num_shards)]

def gen_from_shards(shards):
    for shard in shards:
        for example in shard:
            yield example

train_ds = IterableDataset.from_generator(gen_from_shards, gen_kwargs={"shards": shards[:99]})
val_ds = IterableDataset.from_generator(gen_from_shards, gen_kwargs={"shards": shards[99:]})

Thanks, it worked, it helped to split the dataset into shards as you suggested, but I'm curious why was the dataset so slow why is this happening bcs I do some preprocessing with the Dataset.map() function and it works quite fast (for 34GB of text) it takes around 1,5 hour to process the data (tokenization, chunk merging. etc.) with 6 workers, but then once I'm using this preprocessed dataset the iteration is significantly slower like 10times e.g., 2.2it/s vs 5s/batch.

lhoestq commented 1 year ago

Cool !

but I'm curious why was the dataset so slow why is this happening bcs I do some preprocessing with the Dataset.map() function and it works quite fast (for 34GB of text) it takes around 1,5 hour to process the data (tokenization, chunk merging. etc.) with 6 workers, but then once I'm using this preprocessed dataset the iteration is significantly slower like 10times e.g., 2.2it/s vs 5s/batch.

When you process an unshuffled dataset with map, you iterate over contiguous chunks of data, which is very fast. You get the best speed when you have an iterable dataset as well, when it's based on shards of contiguous data.

This is fast because internally Arrow simply iterates over the record batches.

On the other hand, if you use a map-style dataset in PyTorch, then PyTorch samples uniformly from the files on your disk. This is slower for your disk, and also requires an extra step to get the location of the examples from an index.

huggingface / datasets

Slow dataloading with big datasets issue persists #2252

Stack details: