Training stops randomly around epoch 77

JosseVanDelm commented 5 years ago

Hi there,

I have been using both CycleGAN and Pix2Pix for my thesis, and it's been very helpful for me. I have already been able to train quite a few models on both CycleGAN and Pix2Pix already. Sometimes however the training process stops, without giving any error or warning. (my terminal just states: end of epoch 77 ..., and then nothing happens anymore) If I check the state of my graphics-card (using nvidia-smi), I can see that the GPU memory is still allocated by python, but the GPU-usage is 0% (normally during training, this goes between about 100 and 95%). If I then stop the script and restart training with the --continue_train option, it does work and finishes up after 200 epochs.

For training I only used the train.py script as provided with all the default settings. trainA and trainB contained exactly 1000 images. (pix2pix training also happened on 1000 pairs)

I have had this issue on my machine:

Nvidia GeForce GTX 1060 6GB
Ubuntu 18.04 LTS (Linux 4.15.0-47-generic)
visdom 0.1.8.5 (pip)
torch 0.4.1 (pip)
torchvision 0.2.1 (pip)
dominate 2.3.4 (pip)
Python 3.6.7

I have also had this exact issue on our university's Nvidia DGX-1 server: This was inside a docker container starting from Nvidia's own docker image, I just pip-installed dominate and visdom. On the server the issue was very annoying, when I checked up today, it appeared the server had not been doing any training past epoch 77 of my first experiment (I had 4 more planned) for 3 days.

I am very sorry for the vague description. I really have no clue what causes this issue, and I have not been able to reproduce it intentionally. The issue seems to come up randomly. I have had this happen at least 3 times (2 times on my machine, and once on the DGX) now, in both CycleGAN and Pix2Pix trainings, always around epoch 7x . I hoped that it would have to do something with my machine and it's chaossy configuration, but apparently this also happens on state-of-the-art machines inside a docker.

Any thoughts? Comments? I would be very happy to provide more information about my setup/situation if this post is not clear enough. I am also going to try to run this through pycharm's debugger and see if I can get any wiser about it myself.

Thanks in advance,

Josse

junyanz commented 5 years ago

Not sure what is the reason. Maybe @SsnL @taesungp have a clue.

JosseVanDelm commented 5 years ago

Hi there, after running the train.py-script code through the debugger (and being lucky enought that it got stuck again), I noticed that the program is not running past this line in the train.pyscript

It gets stuck in these lines of pytorch code (comments added by myself):

    while True:  # This loop takes forever
        try:
            r = index_queue.get(timeout=MANAGER_STATUS_CHECK_INTERVAL) # r: <class 'tuple'>: (709,[538])
        except queue.Empty:
            if watchdog.is_alive(): # and for some reason watchdog is always alive
                continue                  # so this loop keeps going forever :(
            else:
                break
        if r is None:
            break
        idx, batch_indices = r
        try:
            samples = collate_fn([dataset[i] for i in batch_indices])
        except Exception:
            data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
        else:
            data_queue.put((idx, samples))
            del samples

This is the stacktrace I get:

_worker_loop, dataloader.py:97
run, process.py:93
_bootstrap, process.py:258
_launch, popen_fork.py:73
__init__, popen_fork.py:19
_Popen, context.py:277
_Popen, context.py:223
start, process.py:105
__init__, dataloader.py:289
__iter__, dataloader.py:501
__iter__, __init__.py:90
<module>, train.py:43

followed by the debugger that calls the train script.

execfile, _pydev_execfile.py:18
run, pydevd.py:1135
main, pydevd.py:1735
<module>, pydevd.py:1741

I still have no clue as to what makes this happen. Any thoughts? Is it possible that this has something to do with the fact that I did not explicitly start the visdom server myself or something? I'll try to keep looking whilst debugging, but this "low-level" code is way out of my comfort zone, so the help of anyone who knows more about this kind of issue is very much appreciated. Thanks!

olivier-gillet commented 5 years ago

I have the same issue, always stopping at 15 epochs. Same I was not starting visdom manually. So I disabled visdom and it is now working. I use tensoborad instead.

ssnl commented 5 years ago

@JosseVanDelm The pytorch code you linked is running in the worker process, and it is supposed to be an infinite loop until the main process sends a signal or dies. The hang could very well be in the main process.

That said, between 0.4.1 and 1.0.0, a lot of improvements are done to the data loader. If the hang is indeed related to the dataloader, upgrading may resolve it.

JosseVanDelm commented 5 years ago

Thank you for your comments @SsnL and @olivier-gillet . This weekend I did 26 consecutive trainings without a hassle on the DGX server. This time I used a slightly newer version of the docker container and used the display_id=0 option with every training to disable visdom. I still have no clue what causes the issue. Maybe the reason for the hang is indeed that I didn't start visdom beforehand manually as @olivier-gillet pointed out as well? Because the last time that I did the training on the DGX I used this container which uses PyTorch commit 81e025d (which is past version 1.0.0, if I am correct) and I had the same issue there.

junyanz / pytorch-CycleGAN-and-pix2pix

Training stops randomly around epoch 77 #619