Open JosseVanDelm opened 5 years ago
Not sure what is the reason. Maybe @SsnL @taesungp have a clue.
Hi there, after running the train.py
-script code through the debugger (and being lucky enought that it got stuck again), I noticed that the program is not running past this line in the train.py
script
It gets stuck in these lines of pytorch code (comments added by myself):
while True: # This loop takes forever
try:
r = index_queue.get(timeout=MANAGER_STATUS_CHECK_INTERVAL) # r: <class 'tuple'>: (709,[538])
except queue.Empty:
if watchdog.is_alive(): # and for some reason watchdog is always alive
continue # so this loop keeps going forever :(
else:
break
if r is None:
break
idx, batch_indices = r
try:
samples = collate_fn([dataset[i] for i in batch_indices])
except Exception:
data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
else:
data_queue.put((idx, samples))
del samples
This is the stacktrace I get:
_worker_loop, dataloader.py:97
run, process.py:93
_bootstrap, process.py:258
_launch, popen_fork.py:73
__init__, popen_fork.py:19
_Popen, context.py:277
_Popen, context.py:223
start, process.py:105
__init__, dataloader.py:289
__iter__, dataloader.py:501
__iter__, __init__.py:90
<module>, train.py:43
followed by the debugger that calls the train script.
execfile, _pydev_execfile.py:18
run, pydevd.py:1135
main, pydevd.py:1735
<module>, pydevd.py:1741
I still have no clue as to what makes this happen. Any thoughts? Is it possible that this has something to do with the fact that I did not explicitly start the visdom server myself or something? I'll try to keep looking whilst debugging, but this "low-level" code is way out of my comfort zone, so the help of anyone who knows more about this kind of issue is very much appreciated. Thanks!
I have the same issue, always stopping at 15 epochs. Same I was not starting visdom manually. So I disabled visdom and it is now working. I use tensoborad instead.
@JosseVanDelm The pytorch code you linked is running in the worker process, and it is supposed to be an infinite loop until the main process sends a signal or dies. The hang could very well be in the main process.
That said, between 0.4.1 and 1.0.0, a lot of improvements are done to the data loader. If the hang is indeed related to the dataloader, upgrading may resolve it.
Thank you for your comments @SsnL and @olivier-gillet .
This weekend I did 26 consecutive trainings without a hassle on the DGX server.
This time I used a slightly newer version of the docker container and used the display_id=0
option with every training to disable visdom.
I still have no clue what causes the issue. Maybe the reason for the hang is indeed that I didn't start visdom beforehand manually as @olivier-gillet pointed out as well?
Because the last time that I did the training on the DGX I used this container which uses PyTorch commit 81e025d (which is past version 1.0.0, if I am correct) and I had the same issue there.
Hi there,
I have been using both CycleGAN and Pix2Pix for my thesis, and it's been very helpful for me. I have already been able to train quite a few models on both CycleGAN and Pix2Pix already. Sometimes however the training process stops, without giving any error or warning. (my terminal just states:
end of epoch 77 ...
, and then nothing happens anymore) If I check the state of my graphics-card (usingnvidia-smi
), I can see that the GPU memory is still allocated by python, but the GPU-usage is 0% (normally during training, this goes between about 100 and 95%). If I then stop the script and restart training with the--continue_train
option, it does work and finishes up after 200 epochs.For training I only used the train.py script as provided with all the default settings. trainA and trainB contained exactly 1000 images. (pix2pix training also happened on 1000 pairs)
I have had this issue on my machine:
I have also had this exact issue on our university's Nvidia DGX-1 server: This was inside a docker container starting from Nvidia's own docker image, I just pip-installed dominate and visdom. On the server the issue was very annoying, when I checked up today, it appeared the server had not been doing any training past epoch 77 of my first experiment (I had 4 more planned) for 3 days.
I am very sorry for the vague description. I really have no clue what causes this issue, and I have not been able to reproduce it intentionally. The issue seems to come up randomly. I have had this happen at least 3 times (2 times on my machine, and once on the DGX) now, in both CycleGAN and Pix2Pix trainings, always around epoch 7x . I hoped that it would have to do something with my machine and it's chaossy configuration, but apparently this also happens on state-of-the-art machines inside a docker.
Any thoughts? Comments? I would be very happy to provide more information about my setup/situation if this post is not clear enough. I am also going to try to run this through pycharm's debugger and see if I can get any wiser about it myself.
Thanks in advance,
Josse