Closed eugeneware closed 3 years ago
Facing a similar issue here. My model using SQuAD dataset takes about 1h to process with in memory data and more than 2h with datasets directly.
And if you use in-memory-data with datasets with load_dataset(..., keep_in_memory=True)
?
Thanks for the tip @thomwolf ! I did not see that flag in the docs. I'll try with that.
We should add it indeed and also maybe a specific section with all the tips for maximal speed. What do you think @lhoestq @SBrandeis @yjernite ?
By default the datasets loaded with load_dataset
live on disk.
It's possible to load them in memory by using some transforms like .map(..., keep_in_memory=True)
.
Small correction to @thomwolf 's comment above: currently we don't have the keep_in_memory
parameter for load_dataset
AFAIK but it would be nice to add it indeed :)
Yes indeed we should add it!
Great! Thanks a lot.
I did a test using map(..., keep_in_memory=True)
and also a test using in-memory only data.
features = dataset.map(tokenize, batched=True, remove_columns=dataset['train'].column_names)
features.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
features_in_memory = dataset.map(tokenize, batched=True, keep_in_memory=True, remove_columns=dataset['train'].column_names)
features_in_memory.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
in_memory = [features['train'][i] for i in range(len(features['train']))]
For using the features without any tweak, I got 1min17s for copying the entire DataLoader to CUDA:
%%time
for i, batch in enumerate(DataLoader(features['train'], batch_size=16, num_workers=4)):
batch['input_ids'].to(device)
For using the features mapped with keep_in_memory=True
, I also got 1min17s for copying the entire DataLoader to CUDA:
%%time
for i, batch in enumerate(DataLoader(features_in_memory['train'], batch_size=16, num_workers=4)):
batch['input_ids'].to(device)
And for the case using every element in memory, converted from the original dataset, I got 12.5s:
%%time
for i, batch in enumerate(DataLoader(in_memory, batch_size=16, num_workers=4)):
batch['input_ids'].to(device)
Taking a closer look in my SQuAD code, using a profiler, I see a lot of calls to posix read
api. It seems that it is really reliying on disk, which results in a very high train time.
I am having the same issue here. When loading from memory I can get the GPU up to 70% util but when loading after mapping I can only get 40%.
In disk:
book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:20%]')
book_corpus = book_corpus.map(encode, batched=True, num_proc=20, load_from_cache_file=True, batch_size=2500)
book_corpus.set_format(type='torch', columns=['text', "input_ids", "attention_mask", "token_type_ids"])
training_args = TrainingArguments(
output_dir="./mobile_bert_big",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=32,
per_device_eval_batch_size=16,
save_steps=50,
save_total_limit=2,
logging_first_step=True,
warmup_steps=100,
logging_steps=50,
eval_steps=100,
no_cuda=False,
gradient_accumulation_steps=16,
fp16=True)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=book_corpus,
tokenizer=tokenizer)
In disk I can only get 0,17 it/s:
[ 13/28907 01:03 < 46:03:27, 0.17 it/s, Epoch 0.00/1]
If I load it with torch.utils.data.Dataset()
class BCorpusDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
item = [torch.tensor(val[idx]) for key, val in self.encodings.items()][0]
return item
def __len__(self):
length = [len(val) for key, val in self.encodings.items()][0]
return length
**book_corpus = book_corpus.select([i for i in range(16*2000)])** # filtering to not have 20% of BC in memory...
book_corpus = book_corpus(book_corpus)
I can get:
[ 5/62 00:09 < 03:03, 0.31 it/s, Epoch 0.06/1]
But obviously I can not get BookCorpus in memory xD
EDIT: it is something weird. If i load in disk 1% of bookcorpus:
book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:1%]')
I can get 0.28 it/s, (the same that in memory) but if I load 20% of bookcorpus:
book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:20%]')
I get again 0.17 it/s.
I am missing something? I think it is something related to size, and not disk or in-memory.
There is a way to increase the batches read from memory? or multiprocessed it? I think that one of two or it is reading with just 1 core o it is reading very small chunks from disk and left my GPU at 0 between batches
My fault! I had not seen the dataloader_num_workers
in TrainingArguments
! Now I can parallelize and go fast! Sorry, and thanks.
I've been very excited about this amazing datasets project. However, I've noticed that the performance can be substantially slower than using an in-memory dataset.
Now, this is expected I guess, due to memory mapping data using arrow files, and you don't get anything for free. But I was surprised at how much slower.
For example, in the
yelp_polarity
dataset (560000 datapoints, or 17500 batches of 32), it was taking me 3:31 to just get process the data and get it on the GPU (no model involved). Whereas, the equivalent in-memory dataset would finish in just 0:33.Is this expected? Given that one of the goals of this project is also accelerate dataset processing, this seems a bit slower than I would expect. I understand the advantages of being able to work on datasets that exceed memory, and that's very exciting to me, but thought I'd open this issue to discuss.
For reference I'm running a AMD Ryzen Threadripper 1900X 8-Core Processor CPU, with 128 GB of RAM and an NVME SSD Samsung 960 EVO. I'm running with an RTX Titan 24GB GPU.
I can see with
iotop
that the dataset gets quickly loaded into the system read buffers, and thus doesn't incur any additional IO reads. Thus in theory, all the data should be in RAM, but in my benchmark code below it's still 6.4 times slower.What am I doing wrong? And is there a way to force the datasets to completely load into memory instead of being memory mapped in cases where you want maximum performance?
At 3:31 for 17500 batches, that's 12ms per batch. Does this 12ms just become insignificant as a proportion of forward and backward passes in practice, and thus it's not worth worrying about this in practice?
In any case, here's my code
benchmark.py
. If you run it with an argument ofmemory
it will copy the data into memory before executing the same test.For reference, my conda environment is here
Once again, I'm very excited about this library, and how easy it is to load datasets, and to do so without worrying about system memory constraints.
Thanks for all your great work.