Closed Alexandre-Delplanque closed 4 years ago
The overwrite error has nothing to do with anything else. It comes from parsing your command line (https://github.com/javiribera/locating-objects-without-bboxes/blob/4923d109e4e0a52cb987e8f367c20e45513b06c7/object-locator/argparser.py#L241) and it's exactly describing what your problem is and how to solve it.
Also your ZeroDivisionError is self-explainatory. Did you set a breakpoint in train.py:L273 and found why args.val_freq is 0? If not, please do so.
Thank you for your answers!
Yes, I understand the overwrite error. I naively thought at first that this was the error that led to the broke pipe, but in reality it didn't.
That's the first error I mentioned that I'm having trouble understanding with, it seems at line 85 of utils.py:
minn, maxx = array.min(), array.max()
The variables are equal to 0. And that's what would cause the error.
Is it due to a package version problem?
Because during the installation of the environment, I had to install in addition :
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
pip install peterpy
pip install ballpark
pip install visdom
In addition, I had to add to line 159 of train.py:
if __name__ == '__main__':
Because of this error:
https://pytorch.org/docs/stable/notes/windows.html#multiprocessing-error-without-if-clause-protection
Sorry for my english and for my perhaps basic programming misunderstandings, I'm a beginner.
I check, it's not due to package version, same error appears.
ZeroDivisionError is self-explainatory. Did you set a breakpoint in train.py:L273 and found why args.val_freq is 0? If not, please do so.
args.val_freq is 0 because I specify this value in the command line, to avoid validation part. It was mentioned in help.
This is incoherent with the command line you said you used in the first post. Please post the command you actualy run and the entire standard output you get in your console, right after it.
Sorry for the misunderstanding.
The command line I used is the one that I mentioned above, in the first post, which is :
python -m object-locator.train --train-dir mytraindir --batch-size 1 --visdom-env training --visdom-server localhost --visdom-port 8097 --epochs 2 --lr 1e-3 --val-dir myvaldir --save otherdir\saved_model.ckpt --nThreads 1 --imgsize 400x400
I said in the post that I tried to train without validation, so with another command line (the same one to which I added "--val-freq 0"), just to check if the error appears only during validation.
So, I need help for :
C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide array_scaled = ((array - minn)/(maxx - minn)*255)
Here is the entire output :
Validating Epoch 0 (529 images): 0%| | 0/529 [00:00<?, ?it/s]Connected to Visdom server http://localhost:8097
# images for training: 1.02K
# images for validation: 529
Building network... with 76.6M trainable parameters. DONE (took 0.439392 seconds)
Validating Epoch 0 (529 images): 0%| | 0/529 [00:03<?, ?it/s, avg_val_loss_this_epoch=292.9-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 1%| | 3/529 [00:04<14:22, 1.64s/it, avg_val_loss_this_epoch=262.2-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 1%|▏ | 7/529 [00:06<07:30, 1.16it/s, avg_val_loss_this_epoch=249.6-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
... idem at each percent ...
Validating Epoch 0 (529 images): 96%|██████████▌| 506/529 [02:30<00:06, 3.36it/s, avg_val_loss_this_epoch=245.1-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 96%|██████████▌| 507/529 [02:30<00:06, 3.36it/s, avg_val_loss_this_epoch=245.2-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 96%|██████████▌| 508/529 [02:31<00:06, 3.36it/s, avg_val_loss_this_epoch=245.1-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 96%|██████████▌| 509/529 [02:31<00:05, 3.37it/s, avg_val_loss_this_epoch=245.2-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 97%|██████████▋| 512/529 [02:31<00:05, 3.37it/s, avg_val_loss_this_epoch=245.7-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 98%|██████████▊| 520/529 [02:34<00:02, 3.36it/s, avg_val_loss_this_epoch=245.3-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 98%|██████████▊| 521/529 [02:34<00:02, 3.37it/s, avg_val_loss_this_epoch=245.3-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 99%|██████████▉| 523/529 [02:35<00:01, 3.37it/s, avg_val_loss_this_epoch=245.4-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 99%|██████████▉| 525/529 [02:35<00:01, 3.37it/s, avg_val_loss_this_epoch=245.5-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 100%|██████████▉| 528/529 [02:36<00:00, 3.37it/s, avg_val_loss_this_epoch=245.1-----]C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide
array_scaled = ((array - minn)/(maxx - minn)*255) \
Validating Epoch 0 (529 images): 100%|███████████| 529/529 [02:36<00:00, 3.37it/s, avg_val_loss_this_epoch=245.1-----]
C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\scipy\stats\stats.py:3003: RuntimeWarning: invalid value encountered in double_scalars
r = r_num / r_den
C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\scipy\stats\stats.py:5240: RuntimeWarning: invalid value encountered in less
x = np.where(x < 1.0, x, 1.0) # if x > 1 then return 1.0
Saved best checkpoint so far in C:\Users\delplaal\Desktop\Point_Detection_M1\saved_model.ckpt
Epoch 1 (1017 images): 0%| | 0/1017 [00:00<?, ?it/s]E: Don't overwrite a checkpoint without resuming from it. Are you sure you want to do that? (if you do, remove it manually).
Traceback (most recent call last):
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\train.py", line 171, in <module>
for batch_idx, (imgs, dictionaries) in enumerate(iter_train):
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\tqdm\_tqdm.py", line 940, in __iter__
for obj in iterable:
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\torch\utils\data\dataloader.py", line 279, in __iter__
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\torch\utils\data\dataloader.py", line 719, in __init__
w.start()
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe```
This is probably a divide-by-zero error, which happens if minn==maxx in this line https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/utils.py#L85
Can you check if minn==maxx all the time? And what is such value?
I set a breakpoint at https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/utils.py#L85 and minn==maxx all the time.
I also checked array values, and it contains 1. exclusively. Therefore minn and maxx = 1. .
I also set a breakpoint at https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/train.py#L334 and est_maps is filled of 1. .
Let me reopen the discussion. Have you found a cause for these values?
If est_maps is all 1's, it means that the network is saying that there are objects everywhere. For that corner case, inference will not be possible. But that case is really strange (I've never seen it), maybe your network is not learning anything even after 1 epoch. Try training for longer before validation (say --val-freq 10) or removing validation altogether. And show your training curves.
Thank your for your response.
I looked at the code and I realized that this is what it was all about. In fact, I noticed that during training, if the network identified the whole image as an object (so est_maps filled with 1's), several times in a row, this was still the case for the rest of the training images during the epoch. As a result, during the validation, the network had not learned much and it was the same phenomenon observed...
So I tried with 256x256 images containing objects, with a lr of 1e-4, validation each 5 epochs and it works. Here is the training curves:
This is a quite strange behaviour. Your network is not training very well. There is also something contradictory. If your estimated probability map is all 1, then I expect term2=0 (there is a proof in the other direction in the first part of the proof in the supplemental PDF of the paper).
If you want to investigate what's wrong, you can start by trying to get the first term of the loss small. That should be easy and if you don't see the blobs get smaller, it would indicate a problem. You can focus on the first term by setting --lambda 0
, and killing the second term by adding
terms_2 *= 0
right afer line https://github.com/javiribera/locating-objects-without-bboxes/blob/d8485608c2625d675e61dfd692da675e8dde4225/object-locator/losses.py#L236
Oh, I misread. So by setting validation every 5 epochs, you're saying you get good results, even with the training plots you showed?
Thank you very much for your answers!
No, I don't get good results. As you see, losses stagnate and do not decrease significantly after the first epochs.
And the blobs decrease at first, locate the objects for a majority of the images, then do not decrease anymore ...
I tried to decrease the LR, but it doesn't seem to improve the results.
Try this
If you want to investigate what's wrong, you can start by trying to get the first term of the loss small. That should be easy and if you don't see the blobs get smaller, it would indicate a problem. You can focus on the first term by setting
--lambda 0
, and killing the second term by addingterms_2 *= 0
right afer line
Thank you!
I tried for 10 epochs, with LR=1e-4 and I set lambda to 0.0001 because it's impossible to put the value to 0 (https://github.com/javiribera/locating-objects-without-bboxes/blob/master/object-locator/argparser.py#L217) and killed term 2 as you mentioned.
Here is the training plot:
The blobs decrease but disappear from epoch 8....
The original problem of this issue is solved (the crashing), so I'm closing it.
This new problem (that you can't train well) is unrelated with that. It seems that the convergence is very noisy so you will have to fine-tune the training to the dataset. Maybe increase batch size? This code is not going to help you experiment with hyperparameters.
Hello!
Several problems appear when I use your codes.
First of all, my data are 4000x6000 drone images, previously cut into 400x400, of animals. I work with Windows 10 on Anaconda prompt, with a 4GB GPU.
I run a training with this command:
python -m object-locator.train --train-dir mytraindir --batch-size 1 --visdom-env training --visdom-server localhost --visdom-port 8097 --epochs 2 --lr 1e-3 --val-dir myvaldir --save otherdir\saved_model.ckpt --nThreads 1 --imgsize 400x400
Training is going well. However, an error occurs during the validation, at each loaded image:
C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\utils.py:86: RuntimeWarning: invalid value encountered in true_divide array_scaled = ((array - minn)/(maxx - minn)*255)
It seems that the denominator is zero, but I do not know where that comes from.
Finally, at the end of this pseudo-validation, this error appears:
E: Don't overwrite a checkpoint without resuming from it. Are you sure you want to do that? (if you do, remove it manually).
And I get a broken pipe. So I can't do more than one epoch.
I also tried without validation and I get this error:
File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\delplaal\AppData\Local\Continuum\anaconda3\envs\object-locator\lib\site-packages\object-locator\train.py", line 273, in <module> if args.save and (epoch + 1) % args.val_freq == 0: ZeroDivisionError: integer division or modulo by zero
I'm stuck. What can I do to fix this?
Here is a sample of my dataset: https://we.tl/t-ybaoqdEqha
Thank you in advance for your answers!
Alexandre