tkipf / c-swm

Contrastive Learning of Structured World Models
MIT License
387 stars 67 forks source link

Data Loading Error #10

Open aadharna opened 2 years ago

aadharna commented 2 years ago

When I try to run the code provided here, I end up hitting a RuntimeError here: https://github.com/tkipf/c-swm/blob/e944b24bcaa42d9ee847f30163437a50f0237aa0/train.py#L104 when running your Shapes-2D build/train functions: python train.py --dataset data/shapes_train.h5 --encoder small --name shapes

Specifically, the error says: RuntimeError: DataLoader worker is killed by signal: Killed. It seems to be coming from the fact that the dataloader is having trouble multi-processing the training set loading. But when I look through your utils files, I am not seeing why this error would exist.

This same error has occurred on both a windows machine as well as a linux machine.

Here's the full trace:

Traceback (most recent call last):
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
    data = self.data_queue.get(timeout=timeout)
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/multiprocessing/connection.py", line 921, in wait
    ready = selector.select(timeout)
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 19014) is killed by signal: Killed. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 104, in <module>
    obs = train_loader.__iter__().next()[0]
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
    idx, data = self._get_data()
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 771, in _get_data
    success, data = self._try_get_data()
  File "/home/aadharna/miniconda3/envs/cswm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 737, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 19012, 19014) exited unexpectedly
(cswm) aadharna@penguin:~/PycharmProjects/c-swm$ 

I'll continue poking around and update when I find the root cause.

mmdjiji commented 7 months ago

Same situation, maybe the RAM is too small to run the train?

YES, RAM PROBLEM