huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Apache License 2.0
19.05k stars 2.64k forks source link

multiprocessing in dataset map "can only test a child process" #847

Closed timothyjlaurent closed 1 year ago

timothyjlaurent commented 3 years ago

Using a dataset with a single 'text' field and a fast tokenizer in a jupyter notebook.

def tokenizer_fn(example):
    return tokenizer.batch_encode_plus(example['text'])

ds_tokenized =, batched=True, num_proc=6, remove_columns=['text'])
RemoteTraceback                           Traceback (most recent call last)
Traceback (most recent call last):
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/multiprocess/", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/", line 156, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/", line 163, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/datasets/", line 1510, in _map_single
    for i in pbar:
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/", line 228, in __iter__
    for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/", line 1186, in __iter__
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/", line 251, in close
    super(tqdm_notebook, self).close(*args, **kwargs)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/", line 1291, in close
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/tqdm/", line 1288, in fp_write
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/lib/", line 91, in new_write
    cb(name, data)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/", line 598, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/", line 146, in publish_output
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/", line 151, in _publish_output
  File "/home/jovyan/share/users/tlaurent/invitae-bert/ve/lib/python3.6/site-packages/wandb/sdk/interface/", line 431, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.6/multiprocessing/", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
lhoestq commented 3 years ago

It looks like an issue with wandb/tqdm here. We're using the multiprocess library instead of the multiprocessing builtin python package to support various types of mapping functions. Maybe there's some sort of incompatibility.

Could you make a minimal script to reproduce or a google colab ?

aiswaryasankar commented 3 years ago

hi facing the same issue here -

`AssertionError: Caught AssertionError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/lib/python3.6/logging/", line 996, in emit stream.write(msg) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/lib/", line 100, in new_write cb(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/", line 723, in _console_callback self._backend.interface.publish_output(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/", line 153, in publish_output self._publish_output(o) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/", line 158, in _publish_output self._publish(rec) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/", line 456, in _publish if self._process and not self._process.is_alive(): File "/usr/lib/python3.6/multiprocessing/", line 134, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/", line 198, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "", line 20, in getitem return_token_type_ids=True File "/usr/local/lib/python3.6/dist-packages/transformers/", line 2405, in encode_plus kwargs, File "/usr/local/lib/python3.6/dist-packages/transformers/", line 2125, in _get_padding_truncation_strategies "Truncation was not explicitly activated but max_length is provided a specific value, " File "/usr/lib/python3.6/logging/", line 1320, in warning self._log(WARNING, msg, args, kwargs) File "/usr/lib/python3.6/logging/", line 1444, in _log self.handle(record) File "/usr/lib/python3.6/logging/", line 1454, in handle self.callHandlers(record) File "/usr/lib/python3.6/logging/", line 1516, in callHandlers hdlr.handle(record) File "/usr/lib/python3.6/logging/", line 865, in handle self.emit(record) File "/usr/lib/python3.6/logging/", line 1000, in emit self.handleError(record) File "/usr/lib/python3.6/logging/", line 917, in handleError sys.stderr.write('--- Logging error ---\n') File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/lib/", line 100, in new_write cb(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/", line 723, in _console_callback self._backend.interface.publish_output(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/", line 153, in publish_output self._publish_output(o) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/", line 158, in _publish_output self._publish(rec) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/", line 456, in _publish if self._process and not self._process.is_alive(): File "/usr/lib/python3.6/multiprocessing/", line 134, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' AssertionError: can only test a child process`

lhoestq commented 3 years ago

It looks like this warning : "Truncation was not explicitly activated but max_length is provided a specific value, " is not handled well by wandb.

The error occurs when calling the tokenizer. Maybe you can try to specify truncation=True when calling the tokenizer to remove the warning ? Otherwise I don't know why wandb would fail on a warning. Maybe one of its logging handlers have some issues with the logging of tokenizers. Maybe @n1t0 knows more about this ?

gaceladri commented 3 years ago

I'm having a similar issue but when I try to do multiprocessing with the DataLoader

Code to reproduce:

from datasets import load_dataset

book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:1%]')
book_corpus =, batched=True, num_proc=20, load_from_cache_file=True, batch_size=5000)
book_corpus.set_format(type='torch', columns=['text', "input_ids", "attention_mask", "token_type_ids"])

from transformers import DataCollatorForWholeWordMask
from transformers import Trainer, TrainingArguments

data_collator = DataCollatorForWholeWordMask(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(

trainer = Trainer(

AssertionError                            Traceback (most recent call last)
<timed eval> in <module>

~/anaconda3/envs/tfm/lib/python3.6/site-packages/transformers/ in train(self, model_path, trial)
    869             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
--> 871             for step, inputs in enumerate(epoch_iterator):
    873                 # Skip past any already trained steps if resuming training

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/ in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/ in _next_data(self)
   1083             else:
   1084                 del self._task_info[idx]
-> 1085                 return self._process_data(data)
   1087     def _try_put_index(self):

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/ in _process_data(self, data)
   1109         self._try_put_index()
   1110         if isinstance(data, ExceptionWrapper):
-> 1111             data.reraise()
   1112         return data

~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/ in reraise(self)
    426             # have message field
    427             raise self.exc_type(message=msg)
--> 428         raise self.exc_type(msg)

AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/", line 1087, in __getitem__
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/", line 1074, in _getitem
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/", line 890, in _convert_outputs
    v = map_nested(command, v, **map_nested_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/utils/", line 225, in map_nested
    return function(data_struct)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/", line 851, in command
    return torch.tensor(x, **format_kwargs)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/", line 101, in _showwarnmsg
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/", line 30, in _showwarnmsg_impl
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/lib/", line 100, in new_write
    cb(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/", line 723, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/", line 153, in publish_output
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/", line 158, in _publish_output
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/", line 456, in _publish
    if self._process and not self._process.is_alive():
  File "/home/ad/anaconda3/envs/tfm/lib/python3.6/multiprocessing/", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

As a workaround I have commented line 456 and 457 in /home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/

thomwolf commented 3 years ago

Isn't it more the pytorch warning on the use of non-writable memory for tensor that trigger this here @lhoestq? (since it seems to be a warning triggered in torch.tensor()

lhoestq commented 3 years ago

Yep this time this is a warning from pytorch that causes wandb to not work properly. Could this by a wandb issue ?

lhoestq commented 3 years ago

Hi @timothyjlaurent @gaceladri If you're running transformers from master you can try setting the env var WAND_DISABLE=true (from and try again ? This issue might be related to

gaceladri commented 3 years ago

I have commented the lines that cause my code break. I'm now seeing my reports on Wandb and my code does not break. I am training now, so I will check probably in 6 hours. I suppose that setting wandb disable will work as well.

mariosasko commented 1 year ago

This seems to be a bug in wandb (see