shunsukesaito / PIFu

This repository contains the code for the paper "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization"
https://shunsukesaito.github.io/PIFu/
Other
1.76k stars 341 forks source link

Training Errors #41

Closed alitokur closed 4 years ago

alitokur commented 4 years ago

Sir, I've created ./bounce/bounce0.txt and ./bounce/face.npy under {path_to_rp_dennis_posed_004_OBJ}. Now, I try to train, but get RunTimeError:


(tokurEnv) hamit@hamit-MS-7B49:~/Softwares/environments/PIFu$ python -m apps.train_shape --dataroot /home/hamit/Softwares/environments/PIFu/tempImages --random_flip --random_scale --random_trans
/home/hamit/Softwares/environments/PIFu/lib/data/TrainDataset.py:102: UserWarning: loadtxt: Empty input file: "/home/hamit/Softwares/environments/PIFu/tempImages/val.txt"
  var_subjects = np.loadtxt(os.path.join(self.root, 'val.txt'), dtype=str)
train data size:  180
test data size:  360
initialize network with normal
Using Network:  hgpifu
Traceback (most recent call last):
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 12928) is killed by signal: Killed. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hamit/Softwares/environments/PIFu/apps/train_shape.py", line 183, in <module>
    train(opt)
  File "/home/hamit/Softwares/environments/PIFu/apps/train_shape.py", line 90, in train
    for train_idx, train_data in enumerate(train_data_loader):
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 12928) exited unexpectedly

PyTorch version : 1.4.0

shunsukesaito commented 4 years ago

If you use num_workers=0, you should be able to see where this error occurs.

alitokur commented 4 years ago

I set it to 0 but have still a error, Sir.

(tokurEnv) hamit@hamit-MS-7B49:~/Softwares/environments/PIFu$ python -m apps.train_shape --dataroot /home/hamit/Softwares/environments/PIFu/tempImages --random_flip --random_scale --random_trans
/home/hamit/Softwares/environments/PIFu/lib/data/TrainDataset.py:102: UserWarning: loadtxt: Empty input file: "/home/hamit/Softwares/environments/PIFu/tempImages/val.txt"
  var_subjects = np.loadtxt(os.path.join(self.root, 'val.txt'), dtype=str)
train data size:  180
test data size:  360
initialize network with normal
Using Network:  hgpifu
Killed
alitokur commented 4 years ago

I've tried again and error:

python -m apps.train_shape --dataroot /home/hamit/Softwares/environments/PIFu/tempImages --random_flip --random_scale --random_trans --num_threads=0
/home/hamit/Softwares/environments/PIFu/lib/data/TrainDataset.py:102: UserWarning: loadtxt: Empty input file: "/home/hamit/Softwares/environments/PIFu/tempImages/val.txt"
  var_subjects = np.loadtxt(os.path.join(self.root, 'val.txt'), dtype=str)
train data size:  180
test data size:  360
initialize network with normal
Using Network:  hgpifu
Traceback (most recent call last):
  File "/home/hamit/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hamit/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hamit/Softwares/environments/PIFu/apps/train_shape.py", line 183, in <module>
    train(opt)
  File "/home/hamit/Softwares/environments/PIFu/apps/train_shape.py", line 105, in train
    res, error = netG.forward(image_tensor, sample_tensor, calib_tensor, labels=label_tensor)
  File "/home/hamit/Softwares/environments/PIFu/lib/model/HGPIFuNet.py", line 131, in forward
    self.filter(images)
  File "/home/hamit/Softwares/environments/PIFu/lib/model/HGPIFuNet.py", line 63, in filter
    self.im_feat_list, self.tmpx, self.normx = self.image_filter(images)
  File "/home/hamit/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/hamit/Softwares/environments/PIFu/lib/model/HGFilters.py", line 129, in forward
    hg = self._modules['m' + str(i)](previous)
  File "/home/hamit/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/hamit/Softwares/environments/PIFu/lib/model/HGFilters.py", line 56, in forward
    return self._forward(self.depth, x)
  File "/home/hamit/Softwares/environments/PIFu/lib/model/HGFilters.py", line 39, in _forward
    low2 = self._forward(level - 1, low1)
  File "/home/hamit/Softwares/environments/PIFu/lib/model/HGFilters.py", line 50, in _forward
    up2 = F.interpolate(low3, scale_factor=2, mode='bicubic', align_corners=True)
  File "/home/hamit/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2459, in interpolate
    " (got {})".format(input.dim(), mode))
NotImplementedError: Input Error: Only 3D, 4D and 5D input Tensors supported (got 4D) for the modes: nearest | linear | bilinear | trilinear (got bicubic)
alitokur commented 4 years ago

I guess, there is a problem right there:

  # NOTE: for newer PyTorch (1.3~), it seems that training results are degraded due to implementation diff in F.grid_sample
        # if the pretrained model behaves weirdly, switch with the commented line.
        # NOTE: I also found that "bicubic" works better.
        up2 = F.interpolate(low3, scale_factor=2, mode='bicubic', align_corners=True)
        # up2 = F.interpolate(low3, scale_factor=2, mode='nearest)

I tried the other up2 but then I see a type error: grid_sample() got an unexpected keyword argument 'align_corners'

shunsukesaito commented 4 years ago

Are you sure you use PyTorch 1.4? The newer pytorch should have this 'align_corners' argument. Please try with the latest pytorch if you keep getting this error.

alitokur commented 4 years ago

i upgraded pytorch, now;

(tokurEnv) hamit@hamit-MS-7B49:~/Softwares/environments/PIFu$ python -c "import torch; print(torch.__version__)" 1.5.1 and nvidia-smi : issue

if i set num_threads=0;

(tokurEnv) hamit@hamit-MS-7B49:~/Softwares/environments/PIFu$ python -m apps.train_shape --dataroot /home/hamit/Softwares/environments/PIFu/tempImages --random_flip --random_scale --random_trans --num_threads=0
/home/hamit/Softwares/environments/PIFu/lib/data/TrainDataset.py:102: UserWarning: loadtxt: Empty input file: "/home/hamit/Softwares/environments/PIFu/tempImages/val.txt"
  var_subjects = np.loadtxt(os.path.join(self.root, 'val.txt'), dtype=str)
train data size:  180
test data size:  360
initialize network with normal
Using Network:  hgpifu
Killed

other way just run the following command:

python -m apps.train_shape --dataroot /home/hamit/Softwares/environments/PIFu/tempImages --random_flip --random_scale --random_trans 
/home/hamit/Softwares/environments/PIFu/lib/data/TrainDataset.py:102: UserWarning: loadtxt: Empty input file: "/home/hamit/Softwares/environments/PIFu/tempImages/val.txt"
  var_subjects = np.loadtxt(os.path.join(self.root, 'val.txt'), dtype=str)
train data size:  180
test data size:  360
initialize network with normal
Using Network:  hgpifu
Traceback (most recent call last):
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 5784) is killed by signal: Killed. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hamit/Softwares/environments/PIFu/apps/train_shape.py", line 183, in <module>
    train(opt)
  File "/home/hamit/Softwares/environments/PIFu/apps/train_shape.py", line 90, in train
    for train_idx, train_data in enumerate(train_data_loader):
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/hamit/Softwares/environments/tokurEnv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 5784) exited unexpectedly

I'm stuck a little bit Sir.

shunsukesaito commented 4 years ago

Okay. Can you spot where the program get stuck inside TrainDataset.py? What if you simply call elements from the dataset without wrapping it with Dataloader?

alitokur commented 4 years ago

took me all weekend but I got it.

hani1994a commented 2 years ago

I solved this problem with reduce number of samples in option.py file