Closed timothyjlaurent closed 1 year ago
It looks like an issue with wandb/tqdm here.
We're using the multiprocess
library instead of the multiprocessing
builtin python package to support various types of mapping functions. Maybe there's some sort of incompatibility.
Could you make a minimal script to reproduce or a google colab ?
hi facing the same issue here -
`AssertionError: Caught AssertionError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/lib/python3.6/logging/init.py", line 996, in emit stream.write(msg) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write cb(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 723, in _console_callback self._backend.interface.publish_output(name, data) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 153, in publish_output self._publish_output(o) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 158, in _publish_output self._publish(rec) File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 456, in _publish if self._process and not self._process.is_alive(): File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' AssertionError: can only test a child process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in max_length
is provided a specific value, "
File "/usr/lib/python3.6/logging/init.py", line 1320, in warning
self._log(WARNING, msg, args, kwargs)
File "/usr/lib/python3.6/logging/init.py", line 1444, in _log
self.handle(record)
File "/usr/lib/python3.6/logging/init.py", line 1454, in handle
self.callHandlers(record)
File "/usr/lib/python3.6/logging/init.py", line 1516, in callHandlers
hdlr.handle(record)
File "/usr/lib/python3.6/logging/init.py", line 865, in handle
self.emit(record)
File "/usr/lib/python3.6/logging/init.py", line 1000, in emit
self.handleError(record)
File "/usr/lib/python3.6/logging/init.py", line 917, in handleError
sys.stderr.write('--- Logging error ---\n')
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
cb(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 723, in _console_callback
self._backend.interface.publish_output(name, data)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 153, in publish_output
self._publish_output(o)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 158, in _publish_output
self._publish(rec)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/interface/interface.py", line 456, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process`
It looks like this warning : "Truncation was not explicitly activated but max_length is provided a specific value, " is not handled well by wandb.
The error occurs when calling the tokenizer.
Maybe you can try to specify truncation=True
when calling the tokenizer to remove the warning ?
Otherwise I don't know why wandb would fail on a warning. Maybe one of its logging handlers have some issues with the logging of tokenizers. Maybe @n1t0 knows more about this ?
I'm having a similar issue but when I try to do multiprocessing with the DataLoader
Code to reproduce:
from datasets import load_dataset
book_corpus = load_dataset('bookcorpus', 'plain_text', cache_dir='/home/ad/Desktop/bookcorpus', split='train[:1%]')
book_corpus = book_corpus.map(encode, batched=True, num_proc=20, load_from_cache_file=True, batch_size=5000)
book_corpus.set_format(type='torch', columns=['text', "input_ids", "attention_mask", "token_type_ids"])
from transformers import DataCollatorForWholeWordMask
from transformers import Trainer, TrainingArguments
data_collator = DataCollatorForWholeWordMask(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
training_args = TrainingArguments(
output_dir="./mobile_linear_att_8L_128_128_03layerdrop_shared",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=64,
save_steps=50,
save_total_limit=2,
logging_first_step=True,
warmup_steps=100,
logging_steps=50,
gradient_accumulation_steps=1,
fp16=True,
**dataloader_num_workers=10**,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=book_corpus,
tokenizer=tokenizer)
trainer.train()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<timed eval> in <module>
~/anaconda3/envs/tfm/lib/python3.6/site-packages/transformers/trainer.py in train(self, model_path, trial)
869 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
870
--> 871 for step, inputs in enumerate(epoch_iterator):
872
873 # Skip past any already trained steps if resuming training
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
1083 else:
1084 del self._task_info[idx]
-> 1085 return self._process_data(data)
1086
1087 def _try_put_index(self):
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
1109 self._try_put_index()
1110 if isinstance(data, ExceptionWrapper):
-> 1111 data.reraise()
1112 return data
1113
~/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/_utils.py in reraise(self)
426 # have message field
427 raise self.exc_type(message=msg)
--> 428 raise self.exc_type(msg)
429
430
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1087, in __getitem__
format_kwargs=self._format_kwargs,
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1074, in _getitem
format_kwargs=format_kwargs,
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 890, in _convert_outputs
v = map_nested(command, v, **map_nested_kwargs)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/utils/py_utils.py", line 225, in map_nested
return function(data_struct)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 851, in command
return torch.tensor(x, **format_kwargs)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 101, in _showwarnmsg
_showwarnmsg_impl(msg)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/warnings.py", line 30, in _showwarnmsg_impl
file.write(text)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
cb(name, data)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/wandb_run.py", line 723, in _console_callback
self._backend.interface.publish_output(name, data)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 153, in publish_output
self._publish_output(o)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 158, in _publish_output
self._publish(rec)
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py", line 456, in _publish
if self._process and not self._process.is_alive():
File "/home/ad/anaconda3/envs/tfm/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
As a workaround I have commented line 456 and 457 in /home/ad/anaconda3/envs/tfm/lib/python3.6/site-packages/wandb/sdk/interface/interface.py
Isn't it more the pytorch warning on the use of non-writable memory for tensor that trigger this here @lhoestq? (since it seems to be a warning triggered in torch.tensor()
Yep this time this is a warning from pytorch that causes wandb to not work properly. Could this by a wandb issue ?
Hi @timothyjlaurent @gaceladri
If you're running transformers
from master
you can try setting the env var WAND_DISABLE=true
(from https://github.com/huggingface/transformers/pull/9896) and try again ?
This issue might be related to https://github.com/huggingface/transformers/issues/9623
I have commented the lines that cause my code break. I'm now seeing my reports on Wandb and my code does not break. I am training now, so I will check probably in 6 hours. I suppose that setting wandb disable will work as well.
This seems to be a bug in wandb
(see https://github.com/wandb/wandb/issues/1994).
Using a dataset with a single 'text' field and a fast tokenizer in a jupyter notebook.