Errors during validation and checkpoint saving

Alexandre-Delplanque commented 4 years ago

Hello!

Several problems appear when I use your codes.

First of all, my data are 4000x6000 drone images, previously cut into 400x400, of animals. I work with Windows 10 on Anaconda prompt, with a 4GB GPU.

I run a training with this command: python -m object-locator.train --train-dir mytraindir --batch-size 1 --visdom-env training --visdom-server localhost --visdom-port 8097 --epochs 2 --lr 1e-3 --val-dir myvaldir --save otherdir\saved_model.ckpt --nThreads 1 --imgsize 400x400

Training is going well. However, an error occurs during the validation, at each loaded image: C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide array_scaled = ((array - minn)/(maxx - minn)*255)

It seems that the denominator is zero, but I do not know where that comes from.

Finally, at the end of this pseudo-validation, this error appears: E: Don't overwrite a checkpoint without resuming from it. Are you sure you want to do that? (if you do, remove it manually).

And I get a broken pipe. So I can't do more than one epoch.

I also tried without validation and I get this error: File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\train.py", line 273, in <module> if args.save and (epoch + 1) % args.val_freq == 0: ZeroDivisionError: integer division or modulo by zero

I'm stuck. What can I do to fix this?

Here is a sample of my dataset: https://we.tl/t-ybaoqdEqha

Thank you in advance for your answers!

Alexandre

javiribera commented 4 years ago

The overwrite error has nothing to do with anything else. It comes from parsing your command line (https://github.com/javiribera/locating-objects-without-bboxes/blob/4923d109e4e0a52cb987e8f367c20e45513b06c7/object-locator/argparser.py#L241) and it's exactly describing what your problem is and how to solve it.

Also your ZeroDivisionError is self-explainatory. Did you set a breakpoint in train.py:L273 and found why args.val_freq is 0? If not, please do so.

Alexandre-Delplanque commented 4 years ago

Thank you for your answers!

Yes, I understand the overwrite error. I naively thought at first that this was the error that led to the broke pipe, but in reality it didn't.

That's the first error I mentioned that I'm having trouble understanding with, it seems at line 85 of utils.py: minn, maxx = array.min(), array.max() The variables are equal to 0. And that's what would cause the error.

Is it due to a package version problem?

Because during the installation of the environment, I had to install in addition :

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
pip install peterpy
pip install ballpark
pip install visdom

In addition, I had to add to line 159 of train.py: if __name__ == '__main__': Because of this error: https://pytorch.org/docs/stable/notes/windows.html#multiprocessing-error-without-if-clause-protection

Sorry for my english and for my perhaps basic programming misunderstandings, I'm a beginner.

Alexandre-Delplanque commented 4 years ago

I check, it's not due to package version, same error appears.

javiribera commented 4 years ago

ZeroDivisionError is self-explainatory. Did you set a breakpoint in train.py:L273 and found why args.val_freq is 0? If not, please do so.

Alexandre-Delplanque commented 4 years ago

args.val_freq is 0 because I specify this value in the command line, to avoid validation part. It was mentioned in help.

javiribera commented 4 years ago

This is incoherent with the command line you said you used in the first post. Please post the command you actualy run and the entire standard output you get in your console, right after it.

Alexandre-Delplanque commented 4 years ago

Sorry for the misunderstanding.

The command line I used is the one that I mentioned above, in the first post, which is :

python -m object-locator.train --train-dir mytraindir --batch-size 1 --visdom-env training --visdom-server localhost --visdom-port 8097 --epochs 2 --lr 1e-3 --val-dir myvaldir --save otherdir\saved_model.ckpt --nThreads 1 --imgsize 400x400

I said in the post that I tried to train without validation, so with another command line (the same one to which I added "--val-freq 0"), just to check if the error appears only during validation.

So, I need help for :

C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide array_scaled = ((array - minn)/(maxx - minn)*255)

Here is the entire output :

Validating Epoch 0 (529 images):   0%|                                                         | 0/529 [00:00<?, ?it/s]Connected to Visdom server http://localhost:8097
# images for training: 1.02K
# images for validation: 529
Building network...  with 76.6M trainable parameters. DONE (took 0.439392 seconds)
Validating Epoch 0 (529 images):   0%|                     | 0/529 [00:03<?, ?it/s, avg_val_loss_this_epoch=292.9-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):   1%|             | 3/529 [00:04<14:22,  1.64s/it, avg_val_loss_this_epoch=262.2-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):   1%|▏            | 7/529 [00:06<07:30,  1.16it/s, avg_val_loss_this_epoch=249.6-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \

... idem at each percent ...


Validating Epoch 0 (529 images):  96%|██████████▌| 506/529 [02:30<00:06,  3.36it/s, avg_val_loss_this_epoch=245.1-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  96%|██████████▌| 507/529 [02:30<00:06,  3.36it/s, avg_val_loss_this_epoch=245.2-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  96%|██████████▌| 508/529 [02:31<00:06,  3.36it/s, avg_val_loss_this_epoch=245.1-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  96%|██████████▌| 509/529 [02:31<00:05,  3.37it/s, avg_val_loss_this_epoch=245.2-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  97%|██████████▋| 512/529 [02:31<00:05,  3.37it/s, avg_val_loss_this_epoch=245.7-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  98%|██████████▊| 520/529 [02:34<00:02,  3.36it/s, avg_val_loss_this_epoch=245.3-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  98%|██████████▊| 521/529 [02:34<00:02,  3.37it/s, avg_val_loss_this_epoch=245.3-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  99%|██████████▉| 523/529 [02:35<00:01,  3.37it/s, avg_val_loss_this_epoch=245.4-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images):  99%|██████████▉| 525/529 [02:35<00:01,  3.37it/s, avg_val_loss_this_epoch=245.5-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 100%|██████████▉| 528/529 [02:36<00:00,  3.37it/s, avg_val_loss_this_epoch=245.1-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
  array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 100%|███████████| 529/529 [02:36<00:00,  3.37it/s, avg_val_loss_this_epoch=245.1-----]
C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\scipy\stats\stats.py:3003: RuntimeWarning: invalid value encountered in double_scalars
  r = r_num / r_den
C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\scipy\stats\stats.py:5240: RuntimeWarning: invalid value encountered in less
  x = np.where(x < 1.0, x, 1.0)  # if x > 1 then return 1.0
Saved best checkpoint so far in C:\Users\delplaal\Desktop\Point_Detection_M1\saved_model.ckpt
Epoch 1 (1017 images):   0%|                                                                  | 0/1017 [00:00<?, ?it/s]E: Don't overwrite a checkpoint without resuming from it. Are you sure you want to do that? (if you do, remove it manually).
Traceback (most recent call last):
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\train.py", line 171, in <module>
    for batch_idx, (imgs, dictionaries) in enumerate(iter_train):
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\tqdm\_tqdm.py", line 940, in __iter__
    for obj in iterable:
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\torch\utils\data\dataloader.py", line 279, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\torch\utils\data\dataloader.py", line 719, in __init__
    w.start()
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe```

javiribera commented 4 years ago

This is probably a divide-by-zero error, which happens if minn==maxx in this line https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/utils.py#L85

Can you check if minn==maxx all the time? And what is such value?

Alexandre-Delplanque commented 4 years ago

I set a breakpoint at https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/utils.py#L85 and minn==maxx all the time.

I also checked array values, and it contains 1. exclusively. Therefore minn and maxx = 1. .

I also set a breakpoint at https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/train.py#L334 and est_maps is filled of 1. .

Alexandre-Delplanque commented 4 years ago

Let me reopen the discussion. Have you found a cause for these values?

javiribera commented 4 years ago

If est_maps is all 1's, it means that the network is saying that there are objects everywhere. For that corner case, inference will not be possible. But that case is really strange (I've never seen it), maybe your network is not learning anything even after 1 epoch. Try training for longer before validation (say --val-freq 10) or removing validation altogether. And show your training curves.

Alexandre-Delplanque commented 4 years ago

newplot Thank your for your response.

I looked at the code and I realized that this is what it was all about. In fact, I noticed that during training, if the network identified the whole image as an object (so est_maps filled with 1's), several times in a row, this was still the case for the rest of the training images during the epoch. As a result, during the validation, the network had not learned much and it was the same phenomenon observed...

So I tried with 256x256 images containing objects, with a lr of 1e-4, validation each 5 epochs and it works. Here is the training curves:

javiribera commented 4 years ago

This is a quite strange behaviour. Your network is not training very well. There is also something contradictory. If your estimated probability map is all 1, then I expect term2=0 (there is a proof in the other direction in the first part of the proof in the supplemental PDF of the paper).

If you want to investigate what's wrong, you can start by trying to get the first term of the loss small. That should be easy and if you don't see the blobs get smaller, it would indicate a problem. You can focus on the first term by setting --lambda 0, and killing the second term by adding terms_2 *= 0 right afer line https://github.com/javiribera/locating-objects-without-bboxes/blob/d8485608c2625d675e61dfd692da675e8dde4225/object-locator/losses.py#L236

javiribera commented 4 years ago

Oh, I misread. So by setting validation every 5 epochs, you're saying you get good results, even with the training plots you showed?

Alexandre-Delplanque commented 4 years ago

Thank you very much for your answers!

No, I don't get good results. As you see, losses stagnate and do not decrease significantly after the first epochs.

And the blobs decrease at first, locate the objects for a majority of the images, then do not decrease anymore ...

I tried to decrease the LR, but it doesn't seem to improve the results.

javiribera commented 4 years ago

Try this

If you want to investigate what's wrong, you can start by trying to get the first term of the loss small. That should be easy and if you don't see the blobs get smaller, it would indicate a problem. You can focus on the first term by setting --lambda 0, and killing the second term by adding terms_2 *= 0 right afer line

https://github.com/javiribera/locating-objects-without-bboxes/blob/d8485608c2625d675e61dfd692da675e8dde4225/object-locator/losses.py#L236

Alexandre-Delplanque commented 4 years ago

Thank you!

I tried for 10 epochs, with LR=1e-4 and I set lambda to 0.0001 because it's impossible to put the value to 0 (https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/argparser.py#L217) and killed term 2 as you mentioned.

Here is the training plot: newplot (1)

The blobs decrease but disappear from epoch 8....

javiribera commented 4 years ago

The original problem of this issue is solved (the crashing), so I'm closing it.

This new problem (that you can't train well) is unrelated with that. It seems that the convergence is very noisy so you will have to fine-tune the training to the dataset. Maybe increase batch size? This code is not going to help you experiment with hyperparameters.

javiribera / locating-objects-without-bboxes

Errors during validation and checkpoint saving #24