jrzaurin / pytorch-widedeep

A flexible package for multimodal-deep-learning to combine tabular data with text and images using Wide and Deep models in Pytorch
Apache License 2.0
1.3k stars 190 forks source link

AssertionError: daemonic processes are not allowed to have children #230

Closed davidfstein closed 2 months ago

davidfstein commented 2 months ago

When training with TrainerFromFolder as follows

model = WideDeep(
    deeptabular=deepdense,
    deeptext=basic_rnn,
)

trainer = TrainerFromFolder(
    model,
    objective="binary",
)

trainer.fit(
    train_loader=train_loader,
    eval_loader=val_loader,
    device='gpu'
)

I'm running into

 0%|                                                  | 0/23572 [00:00<?, ?it/s]/home/david/micromamba/envs/v2p2_train/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=103313) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
  0%|                                                  | 0/23572 [04:11<?, ?it/s]

AssertionError                            Traceback (most recent call last)
Cell In[17], line 1
----> 1 trainer.fit(
      2     train_loader=train_loader,
      3     eval_loader=val_loader,
      4     device='gpu'
      5     # finetune=True,
      6     # finetune_epochs=1,
      7 )

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/utils/general_utils.py:12, in alias.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     10         kwargs[original_name] = kwargs.pop(alt_name)
     11         break
---> 12 return func(*args, **kwargs)

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/training/trainer_from_folder.py:260, in TrainerFromFolder.fit(self, train_loader, eval_loader, n_epochs, validation_freq, finetune, **kwargs)
    258 self.train_running_loss = 0.0
    259 with trange(train_steps, disable=self.verbose != 1) as t:
--> 260     for batch_idx, (data, targett) in zip(t, train_loader):
    261         t.set_description("epoch %i" % (epoch + 1))
    262         train_score, train_loss = self._train_step(
    263             data, targett, batch_idx, epoch
    264         )

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
    627 if self._sampler_iter is None:
    628     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    629     self._reset()  # type: ignore[call-arg]
--> 630 data = self._next_data()
    631 self._num_yielded += 1
    632 if self._dataset_kind == _DatasetKind.Iterable and \
    633         self._IterableDataset_len_called is not None and \
    634         self._num_yielded > self._IterableDataset_len_called:

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:1344, in _MultiProcessingDataLoaderIter._next_data(self)
   1342 else:
   1343     del self._task_info[idx]
-> 1344     return self._process_data(data)

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:1370, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1368 self._try_put_index()
   1369 if isinstance(data, ExceptionWrapper):
-> 1370     data.reraise()
   1371 return data

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/_utils.py:706, in ExceptionWrapper.reraise(self)
    702 except TypeError:
    703     # If the exception takes multiple arguments, don't try to
    704     # instantiate since we don't know how to
    705     raise RuntimeError(msg) from None
--> 706 raise exception

AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/wd_dataset_from_folder.py", line 108, in __getitem__
    X_text = self.text_from_folder.get_item(text_fname_or_text)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/text/text_from_folder.py", line 67, in get_item
    processed_sample = self._preprocess_one_sample(text, self.preprocessor)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/text/text_from_folder.py", line 92, in _preprocess_one_sample
    processed_sample = preprocessor.transform_sample(sample)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/preprocessing/text_preprocessor.py", line 176, in transform_sample
    tokens = get_texts([text], self.already_processed, self.n_cpus)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/utils/text_utils.py", line 109, in get_texts
    tok = Tokenizer(n_cpus=num_cpus).process_all(processed_texts)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/utils/fastai_transforms.py", line 338, in process_all
    e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), []
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/concurrent/futures/process.py", line 859, in map
    results = super().map(partial(_process_chunk, fn),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/concurrent/futures/process.py", line 831, in submit
    self._start_executor_manager_thread()
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/concurrent/futures/process.py", line 770, in _start_executor_manager_thread
    self._launch_processes()
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/concurrent/futures/process.py", line 797, in _launch_processes
    self._spawn_process()
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/concurrent/futures/process.py", line 807, in _spawn_process
    p.start()
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/multiprocessing/process.py", line 118, in start
    assert not _current_process._config.get('daemon'), \
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: daemonic processes are not allowed to have children
jrzaurin commented 2 months ago

This looks like a more serious one 😄

Ok, so can you try set the number of cpus to 1 in the tokenizer?

I am now traveling, I can look into this tomorrow.

We normally experience some issues with conda env (or alike) but I will check tomorrow. I remember I also had problems with the "parallelization" of the tokenizer, moreover when multiprocessing defaults to fork.

For now try setting the tokenizer cpus to 1, lets see

davidfstein commented 2 months ago

Ok I set the the cpus to 1, but I was running into a new error. I removed the text datasets entirely to see if I could get it working. I'm running into this error now:

   0%|                                                      | 0/1 [00:00<?, ?it/s]/home/david/micromamba/envs/v2p2_train/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=116542) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
  0%|                                                      | 0/1 [00:17<?, ?it/s]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 trainer.fit(
      2     train_loader=train_loader,
      3     eval_loader=val_loader,
      4     device='gpu'
      5     # finetune=True,
      6     # finetune_epochs=1,
      7 )

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/utils/general_utils.py:12, in alias.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     10         kwargs[original_name] = kwargs.pop(alt_name)
     11         break
---> 12 return func(*args, **kwargs)

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/training/trainer_from_folder.py:261, in TrainerFromFolder.fit(self, train_loader, eval_loader, n_epochs, validation_freq, finetune, **kwargs)
    259 self.train_running_loss = 0.0
    260 with trange(train_steps, disable=self.verbose != 1) as t:
--> 261     for batch_idx, (data, targett) in zip(t, train_loader):
    262         t.set_description("epoch %i" % (epoch + 1))
    263         print('Weve started fitting')

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
    627 if self._sampler_iter is None:
    628     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    629     self._reset()  # type: ignore[call-arg]
--> 630 data = self._next_data()
    631 self._num_yielded += 1
    632 if self._dataset_kind == _DatasetKind.Iterable and \
    633         self._IterableDataset_len_called is not None and \
    634         self._num_yielded > self._IterableDataset_len_called:

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:1344, in _MultiProcessingDataLoaderIter._next_data(self)
   1342 else:
   1343     del self._task_info[idx]
-> 1344     return self._process_data(data)

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:1370, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1368 self._try_put_index()
   1369 if isinstance(data, ExceptionWrapper):
-> 1370     data.reraise()
   1371 return data

File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/_utils.py:706, in ExceptionWrapper.reraise(self)
    702 except TypeError:
    703     # If the exception takes multiple arguments, don't try to
    704     # instantiate since we don't know how to
    705     raise RuntimeError(msg) from None
--> 706 raise exception

ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py", line 121, in get_item
    _sample = pd.read_csv(
              ^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
    return mapping[engine](f, **self.options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/wd_dataset_from_folder.py", line 86, in __getitem__
    X_tab, text_fname_or_text, img_fname, y = self.tab_from_folder.get_item(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py", line 126, in get_item
    raise ValueError("Currently only csv format is supported.")
ValueError: Currently only csv format is supported.

This is coming from TabFromFolder. It looks like this is an issue with the indexing of the CSV with pandas.

Specifically, _sample = pd.read_csv(path, skiprows=lambda x: x != idx + 1, header=None).values is accessing out of bounds. Changing idx + 1 to idx fixes this. There appear to be other issues once this is resolved, but I'll add a separate comment for those.

davidfstein commented 2 months ago

Ok so the above issue appears to be with my specification of train_size. I created a sample file with 1000 rows including the header and set train_size to 1000. This should actually be 999. When I set this correctly, training proceeds as expected. So the original issue can be temporarily resolved by setting n_cpu to 1

jrzaurin commented 2 months ago

ok, perfect!

jrzaurin commented 2 months ago

Hi @davidfstein

Regarding the parallel processing, could you do me a favour, whatever code you run that has the call to the mutiprocessing process (the Tokenizer in this case), can you please run it within a

if __name__ == '__main__'

block?

so, for example

...
from pytorch_widedeep.preprocessing import TextPreprocessor

if __name__ == "__main__":

    text_preprocessor = TextPreprocessor(
        text_col="text", max_vocab=5000, min_freq=5, maxlen=90, n_cpus=1
    )

    X_text_tr = text_preprocessor.fit_transform(train)
    X_text_te = text_preprocessor.transform(test)

    ....