tiangexiang / OccNeRF

[ICCV 2023] Rendering Humans from Object-Occluded Monocular Videos
https://cs.stanford.edu/~xtiange/projects/occnerf/
MIT License
42 stars 1 forks source link

Training problem #5

Open DavidTu21 opened 8 months ago

DavidTu21 commented 8 months ago

Dear authors,

Thank you for your amazing work. I just encounter a sudden issue while training on the ZJU mocap 387 example and would like to seek for some advice if possible. Specifically, the problem encounter the following error during the 36 epoch, and I tried to resume the training but the same error occurred. My current output folder contains logs.txt, neural_points_019500.jpg, prog_019000.jpg, neural_points_019000.jpg, prog_018500.jpg, etc. Thank you in advance for your help.

I am running on Ubuntu18.04 with RTX3090.

Epoch: 35 [Iter 19000, 459/540 (85%), 23.11 sec] Loss: 0.2378 [lpips: 0.2348 mse: 0.0027 comp_loss: 0.0003 ]
Saving neural points with visibility attention ...
Neural points changes: 29.220615
Evaluate Progress Images ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [01:05<00:00,  4.10s/it]
Epoch: 35 [Iter 19020, 479/540 (89%), 89.87 sec] Loss: 0.2287 [lpips: 0.2241 mse: 0.0042 comp_loss: 0.0004 ]
Epoch: 35 [Iter 19040, 499/540 (92%), 22.76 sec] Loss: 0.2969 [lpips: 0.2889 mse: 0.0076 comp_loss: 0.0004 ]
Epoch: 35 [Iter 19060, 519/540 (96%), 21.54 sec] Loss: 0.1975 [lpips: 0.1935 mse: 0.0037 comp_loss: 0.0004 ]
Epoch: 35 [Iter 19080, 539/540 (100%), 23.05 sec] Loss: 0.1769 [lpips: 0.1732 mse: 0.0034 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19100, 19/540 (4%), 23.69 sec] Loss: 0.4381 [lpips: 0.4301 mse: 0.0077 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19120, 39/540 (7%), 22.53 sec] Loss: 0.2930 [lpips: 0.2856 mse: 0.0070 comp_loss: 0.0004 ]
Epoch: 36 [Iter 19140, 59/540 (11%), 23.18 sec] Loss: 0.2508 [lpips: 0.2475 mse: 0.0028 comp_loss: 0.0004 ]
Epoch: 36 [Iter 19160, 79/540 (15%), 22.85 sec] Loss: 0.1608 [lpips: 0.1567 mse: 0.0037 comp_loss: 0.0004 ]
Epoch: 36 [Iter 19180, 99/540 (18%), 23.89 sec] Loss: 0.2364 [lpips: 0.2264 mse: 0.0097 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19200, 119/540 (22%), 23.53 sec] Loss: 0.1317 [lpips: 0.1308 mse: 0.0005 comp_loss: 0.0004 ]
Epoch: 36 [Iter 19220, 139/540 (26%), 22.52 sec] Loss: 0.3992 [lpips: 0.3952 mse: 0.0037 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19240, 159/540 (29%), 22.78 sec] Loss: 0.1107 [lpips: 0.1085 mse: 0.0018 comp_loss: 0.0004 ]
Epoch: 36 [Iter 19260, 179/540 (33%), 22.63 sec] Loss: 0.3159 [lpips: 0.3017 mse: 0.0139 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19280, 199/540 (37%), 22.28 sec] Loss: 0.3333 [lpips: 0.3263 mse: 0.0067 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19300, 219/540 (41%), 22.68 sec] Loss: 0.2897 [lpips: 0.2823 mse: 0.0070 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19320, 239/540 (44%), 23.01 sec] Loss: 0.3260 [lpips: 0.3091 mse: 0.0166 comp_loss: 0.0004 ]
Epoch: 36 [Iter 19340, 259/540 (48%), 23.22 sec] Loss: 0.2215 [lpips: 0.2185 mse: 0.0026 comp_loss: 0.0004 ]
Epoch: 36 [Iter 19360, 279/540 (52%), 22.68 sec] Loss: 0.3639 [lpips: 0.3486 mse: 0.0150 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19380, 299/540 (55%), 22.86 sec] Loss: 0.2936 [lpips: 0.2865 mse: 0.0069 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19400, 319/540 (59%), 22.07 sec] Loss: 0.2340 [lpips: 0.2296 mse: 0.0041 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19420, 339/540 (63%), 22.56 sec] Loss: 0.4111 [lpips: 0.3986 mse: 0.0122 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19440, 359/540 (66%), 22.63 sec] Loss: 0.2444 [lpips: 0.2367 mse: 0.0074 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19460, 379/540 (70%), 22.57 sec] Loss: 0.3595 [lpips: 0.3451 mse: 0.0141 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19480, 399/540 (74%), 22.58 sec] Loss: 0.1769 [lpips: 0.1751 mse: 0.0015 comp_loss: 0.0003 ]
Epoch: 36 [Iter 19500, 419/540 (78%), 22.56 sec] Loss: 0.2978 [lpips: 0.2876 mse: 0.0099 comp_loss: 0.0003 ]
Saving neural points with visibility attention ...
Neural points changes: 27.522547
Evaluate Progress Images ...
  0%|                                                                                                                               | 0/16 [00:00<?, ?it/s]Exception ignored in: <function Image.__del__ at 0x7f3e890a34d0>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 3507, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
 12%|██████████████▉                                                                                                        | 2/16 [00:07<00:54,  3.87s/it]Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
 19%|██████████████████████▎                                                                                                | 3/16 [00:11<00:51,  3.96s/it]Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Image.__del__ at 0x7f3e890a34d0>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 3507, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f3e89434a70>
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/tkinter/__init__.py", line 332, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
 19%|██████████████████████▎                                                                                                | 3/16 [00:26<01:56,  9.00s/it]
Traceback (most recent call last):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 986, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/multiprocessing/connection.py", line 921, in wait
    ready = selector.select(timeout)
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1049826) is killed by signal: Aborted.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 63, in <module>
    main()
  File "train.py", line 57, in main
    train_dataloader=train_loader)
  File "core/train/trainers/occnerf/trainer.py", line 271, in train
    is_reload_model = self.progress()
  File "core/train/trainers/occnerf/trainer.py", line 340, in progress
    for _, batch in enumerate(tqdm(self.prog_dataloader)):
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1182, in _next_data
    idx, data = self._get_data()
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1148, in _get_data
    success, data = self._try_get_data()
  File "/mnt/HDD4/mitu8956/anaconda/envs/occnerf2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 999, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1049826) exited unexpectedly
tiangexiang commented 8 months ago

Thanks for your interests in our work! This error seems strange and may not relate to the code itself. I have never encountered this problem, but I guess the dataloader error either comes from multi-processing/multi-workers or memory issues. I would suggest to change patch-size from 32 to a lower value or change the number of workers to be less. Here is a related thread: https://github.com/pytorch/pytorch/issues/8976 P.S. I think 20000ish step is sufficient for training? have you checked the validation results to see if the network has converged?

DavidTu21 commented 8 months ago

Hi Tiange,

Thank you very much for your reply and the input. I changed the patch-size in the configs/occnerf/zju_mocap/387/occnerf.yaml from 32 to 16 first and will see how it goes.

Additionally, I didn't see an output of the loss and epochs so I attached the output of prog_019000.png and may I ask do you think we need further training? Thank you again for your time.

prog_019000

Kind regards, David

tiangexiang commented 8 months ago

Hi David, the results look good to me! I think you can stop there and report results :)

DavidTu21 commented 7 months ago

Hi Tiange, thank you very much for your input! I will close this issue then :)

DavidTu21 commented 7 months ago

Sorry Tiange just one last question, if I'd like to further test the code on a customized data (for example, a 3D body scan with real occlusions, in a similar format as the ZJU mocap data), will the current code be able to handle that?

tiangexiang commented 7 months ago

Yes, it can handle real-world occlusions in any other datasets. But remember to remove the masking schema that is specified in the config file and executed in the data loader. Since the binary human masks for real-world occlusions already show visible body parts, no extra masking is needed.