Closed davidfstein closed 2 months ago
This looks like a more serious one 😄
Ok, so can you try set the number of cpus to 1 in the tokenizer?
I am now traveling, I can look into this tomorrow.
We normally experience some issues with conda env (or alike) but I will check tomorrow. I remember I also had problems with the "parallelization" of the tokenizer, moreover when multiprocessing defaults to fork.
For now try setting the tokenizer cpus to 1, lets see
Ok I set the the cpus to 1, but I was running into a new error. I removed the text datasets entirely to see if I could get it working. I'm running into this error now:
0%| | 0/1 [00:00<?, ?it/s]/home/david/micromamba/envs/v2p2_train/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=116542) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
0%| | 0/1 [00:17<?, ?it/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[19], line 1
----> 1 trainer.fit(
2 train_loader=train_loader,
3 eval_loader=val_loader,
4 device='gpu'
5 # finetune=True,
6 # finetune_epochs=1,
7 )
File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/utils/general_utils.py:12, in alias.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
10 kwargs[original_name] = kwargs.pop(alt_name)
11 break
---> 12 return func(*args, **kwargs)
File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/training/trainer_from_folder.py:261, in TrainerFromFolder.fit(self, train_loader, eval_loader, n_epochs, validation_freq, finetune, **kwargs)
259 self.train_running_loss = 0.0
260 with trange(train_steps, disable=self.verbose != 1) as t:
--> 261 for batch_idx, (data, targett) in zip(t, train_loader):
262 t.set_description("epoch %i" % (epoch + 1))
263 print('Weve started fitting')
File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
627 if self._sampler_iter is None:
628 # TODO(https://github.com/pytorch/pytorch/issues/76750)
629 self._reset() # type: ignore[call-arg]
--> 630 data = self._next_data()
631 self._num_yielded += 1
632 if self._dataset_kind == _DatasetKind.Iterable and \
633 self._IterableDataset_len_called is not None and \
634 self._num_yielded > self._IterableDataset_len_called:
File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:1344, in _MultiProcessingDataLoaderIter._next_data(self)
1342 else:
1343 del self._task_info[idx]
-> 1344 return self._process_data(data)
File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/dataloader.py:1370, in _MultiProcessingDataLoaderIter._process_data(self, data)
1368 self._try_put_index()
1369 if isinstance(data, ExceptionWrapper):
-> 1370 data.reraise()
1371 return data
File ~/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/_utils.py:706, in ExceptionWrapper.reraise(self)
702 except TypeError:
703 # If the exception takes multiple arguments, don't try to
704 # instantiate since we don't know how to
705 raise RuntimeError(msg) from None
--> 706 raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py", line 121, in get_item
_sample = pd.read_csv(
^^^^^^^^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
return mapping[engine](f, **self.options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
self._reader = parsers.TextReader(src, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/wd_dataset_from_folder.py", line 86, in __getitem__
X_tab, text_fname_or_text, img_fname, y = self.tab_from_folder.get_item(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/david/micromamba/envs/v2p2_train/lib/python3.12/site-packages/pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py", line 126, in get_item
raise ValueError("Currently only csv format is supported.")
ValueError: Currently only csv format is supported.
This is coming from TabFromFolder. It looks like this is an issue with the indexing of the CSV with pandas.
Specifically, _sample = pd.read_csv(path, skiprows=lambda x: x != idx + 1, header=None).values
is accessing out of bounds. Changing idx + 1
to idx
fixes this. There appear to be other issues once this is resolved, but I'll add a separate comment for those.
Ok so the above issue appears to be with my specification of train_size. I created a sample file with 1000 rows including the header and set train_size to 1000. This should actually be 999. When I set this correctly, training proceeds as expected. So the original issue can be temporarily resolved by setting n_cpu to 1
ok, perfect!
Hi @davidfstein
Regarding the parallel processing, could you do me a favour, whatever code you run that has the call to the mutiprocessing process (the Tokenizer in this case), can you please run it within a
if __name__ == '__main__'
block?
so, for example
...
from pytorch_widedeep.preprocessing import TextPreprocessor
if __name__ == "__main__":
text_preprocessor = TextPreprocessor(
text_col="text", max_vocab=5000, min_freq=5, maxlen=90, n_cpus=1
)
X_text_tr = text_preprocessor.fit_transform(train)
X_text_te = text_preprocessor.transform(test)
....
When training with TrainerFromFolder as follows
I'm running into