huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.05k stars 2.64k forks source link

multiprocessing in dataset map "can only test a child process" #847

Closed timothyjlaurent closed 1 year ago

timothyjlaurent commented 3 years ago

Using a dataset with a single 'text' field and a fast tokenizer in a jupyter notebook.

def tokenizer_fn(example):
    return tokenizer.batch_encode_plus(example['text'])

ds_tokenized = text_dataset.map(tokenizer_fn, batched=True, num_proc=6, remove_columns=['text'])
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 156, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/fingerprint.py", line 163, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1510, in _map_single
    for i in pbar:
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/notebook.py", line 228, in __iter__
    for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1186, in __iter__
    self.close()
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/notebook.py", line 251, in close
    super(tqdm_notebook, self).close(*args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1291, in close
    fp_write('')
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/std.py", line 1288, in fp_write
    self.fp.write(_unicode(s))
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 91, in new_write
    cb(name, data)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 598, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 146, in publish_output
    self._publish_output(o)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 151, in _publish_output
    self._publish(rec)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 431, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
"""
lhoestq commented 3 years ago

It looks like an issue with wandb/tqdm here. We're using the multiprocess library instead of the multiprocessing builtin python package to support various types of mapping functions. Maybe there's some sort of incompatibility.

Could you make a minimal script to reproduce or a google colab ?

aiswaryasankar commented 3 years ago

hi facing the same issue here -

`AssertionError: Caught AssertionError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/lib/python3.6/logging/init.py", line 996, in emit stream.write(msg) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write cb(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 723, in _console_callback self._backend.interface.publish_output(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 153, in publish_output self._publish_output(o) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 158, in _publish_output self._publish(rec) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 456, in _publish if self._process and not self._process.is_alive(): File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "", line 20, in getitem return_token_type_ids=True File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 2405, in encode_plus kwargs, File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 2125, in _get_padding_truncation_strategies "Truncation was not explicitly activated but max_length is provided a specific value, " File "/usr/lib/python3.6/logging/init.py", line 1320, in warning self._log(WARNING, msg, args, kwargs) File "/usr/lib/python3.6/logging/init.py", line 1444, in _log self.handle(record) File "/usr/lib/python3.6/logging/init.py", line 1454, in handle self.callHandlers(record) File "/usr/lib/python3.6/logging/init.py", line 1516, in callHandlers hdlr.handle(record) File "/usr/lib/python3.6/logging/init.py", line 865, in handle self.emit(record) File "/usr/lib/python3.6/logging/init.py", line 1000, in emit self.handleError(record) File "/usr/lib/python3.6/logging/init.py", line 917, in handleError sys.stderr.write('--- Logging error ---\n') File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write cb(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 723, in _console_callback self._backend.interface.publish_output(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 153, in publish_output self._publish_output(o) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 158, in _publish_output self._publish(rec) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 456, in _publish if self._process and not self._process.is_alive(): File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' AssertionError: can only test a child process`

lhoestq commented 3 years ago

It looks like this warning : "Truncation was not explicitly activated but max_length is provided a specific value, " is not handled well by wandb.

The error occurs when calling the tokenizer. Maybe you can try to specify truncation=True when calling the tokenizer to remove the warning ? Otherwise I don't know why wandb would fail on a warning. Maybe one of its logging handlers have some issues with the logging of tokenizers. Maybe @n1t0 knows more about this ?

gaceladri commented 3 years ago

I'm having a similar issue but when I try to do multiprocessing with the DataLoader

Code to reproduce:

from datasets import load_dataset

book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:1%]')
book_corpus = book_corpus.map(encode, batched=True, num_proc=20, load_from_cache_file=True, batch_size=5000)
book_corpus.set_format(type='torch', columns=['text', "input_ids", "attention_mask", "token_type_ids"])

from transformers import DataCollatorForWholeWordMask
from transformers import Trainer, TrainingArguments

data_collator = DataCollatorForWholeWordMask(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(
    output_dir="./mobile_linear_att_8L_128_128_03layerdrop_shared",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=50,
    save_total_limit=2,
    logging_first_step=True,
    warmup_steps=100,
    logging_steps=50,
    gradient_accumulation_steps=1,
    fp16=True,
    **dataloader_num_workers=10**,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=book_corpus,
    tokenizer=tokenizer)

trainer.train()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<timed eval> in <module>

~/anaconda3/envs/tfm/lib/python3.6/site-packages/transformers/trainer.py in train(self, model_path, trial)
    869             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
    870 
--> 871             for step, inputs in enumerate(epoch_iterator):
    872 
    873                 # Skip past any already trained steps if resuming training

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
   1083             else:
   1084                 del self._task_info[idx]
-> 1085                 return self._process_data(data)
   1086 
   1087     def _try_put_index(self):

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1109         self._try_put_index()
   1110         if isinstance(data, ExceptionWrapper):
-> 1111             data.reraise()
   1112         return data
   1113 

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/_utils.py in reraise(self)
    426             # have message field
    427             raise self.exc_type(message=msg)
--> 428         raise self.exc_type(msg)
    429 
    430 

AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1087, in __getitem__
    format_kwargs=self._format_kwargs,
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1074, in _getitem
    format_kwargs=format_kwargs,
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 890, in _convert_outputs
    v = map_nested(command, v, **map_nested_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/utils/py_utils.py", line 225, in map_nested
    return function(data_struct)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 851, in command
    return torch.tensor(x, **format_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 101, in _showwarnmsg
    _showwarnmsg_impl(msg)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 30, in _showwarnmsg_impl
    file.write(text)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 723, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 153, in publish_output
    self._publish_output(o)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 158, in _publish_output
    self._publish(rec)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 456, in _publish
    if self._process and not self._process.is_alive():
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

As a workaround I have commented line 456 and 457 in /home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py

thomwolf commented 3 years ago

Isn't it more the pytorch warning on the use of non-writable memory for tensor that trigger this here @lhoestq? (since it seems to be a warning triggered in torch.tensor()

lhoestq commented 3 years ago

Yep this time this is a warning from pytorch that causes wandb to not work properly. Could this by a wandb issue ?

lhoestq commented 3 years ago

Hi @timothyjlaurent @gaceladri If you're running transformers from master you can try setting the env var WAND_DISABLE=true (from https://github.com/huggingface/transformers/pull/9896) and try again ? This issue might be related to https://github.com/huggingface/transformers/issues/9623

gaceladri commented 3 years ago

I have commented the lines that cause my code break. I'm now seeing my reports on Wandb and my code does not break. I am training now, so I will check probably in 6 hours. I suppose that setting wandb disable will work as well.

mariosasko commented 1 year ago

This seems to be a bug in wandb (see https://github.com/wandb/wandb/issues/1994).