lmb-freiburg / Unet-Segmentation

The U-Net Segmentation plugin for Fiji (ImageJ)
https://lmb.informatik.uni-freiburg.de/resources/opensource/unet
GNU General Public License v3.0
87 stars 25 forks source link

Finetuning ERROR: unknown command line flag #15

Closed auesro closed 5 years ago

auesro commented 5 years ago

Dear all,

I just installed U-net on my local computer, running Ubuntu 16.04, nvidia-driver 410, CUDA 10, cuDNN v7, etc. I installed from the pre-compiled binaries. Segmentation using the provided models work as intended. However starting the Finetuning function gives the following errors: screenshot from 2019-02-09 19-01-43

As far as I can understand, caffe_unet does not includes the flags siging_effect nor solver, while caffe does: caffe in Terminal: screenshot from 2019-02-09 19-04-05

caffe_unet in Terminal: screenshot from 2019-02-09 19-05-25

Any idea on how to solve this?

Thank you very much

ThorstenFalk commented 5 years ago

Finetuning is done using caffe, not caffe_unet, edit your IJPrefs.txt and search for .unet.caffeBinary and replace caffe_unet by caffe, then finetuning should work as expected

auesro commented 5 years ago

Thanks for the quick reply.

Apparently, after making the suggested change another issue comes up: when running the Segmentation example in your video tutorial:

screenshot from 2019-02-09 23-39-30 screenshot from 2019-02-09 23-40-26

No clue whats wrong now...

ThorstenFalk commented 5 years ago

Me neither, unknown error 30 is hard to track down and I never encountered it. Is this reproducible after GPU reset?

ThorstenFalk commented 5 years ago

I just tested it with your Ubuntu cuda and cuDNN combination. The corresponding tarball caffe_unet_package_16.04_gpu_cuda10_cudnn7.tar.gz works for me using an nVidia TitanX (Maxwell)

auesro commented 5 years ago

Dear Thorsten, Thanks for your help. I ended up compiling from source since (due to the use of a different software) I needed CUDA 9.0. Compilation went through no problem. Ran the Segmentation, Detection, and Finetuning provided examples without issue, however finetuning for the Optogenetic examples gives the following error:

screenshot from 2019-02-10 15-32-59

A quick search in Google yields nothing I can make sense of...

ThorstenFalk commented 5 years ago

The error occurs at iteration 103? That's very strange. The error looks like one of your input hdf5 files is corrupt, but then the job should have failed within the first 7 iterations (assuming 7 train files).

Concurrent file access or hardware failure are my best bets in this case. Is the problem reproducible?

auesro commented 5 years ago

I will try to reproduce it and report back.

auesro commented 5 years ago

So I installed U-net in a different workstation and it runs smoothly, the only difference being Ubuntu 16.04 (first computer) vs Linux Mint Cinnamon 19.1 (second computer). So I guess something is wrong in the Ubuntu installation...

Now I started finetuning the original 2D cell net on my own data but I am getting very poor detection results. How essential it is to provide raw data to the network? I started training on images which had been contrast/brightness adjusted...mistake?

ThorstenFalk commented 5 years ago

Interesting, I never had problems with Ubuntu, but if it works on Mint, great!

The plugin will normalize the input data to [0,1] range anyways. Thus, an affine intensity transformation (i.e. contrast/brightness adjustment) does not matter, as long as you do not clip values.

Annotation ambiguities are the more likely cause for poor performance. Although in the first place it sounds easier to only put a dot on each instance you want to detect, it can be tricky to place these dots consistently. All instances must be marked at a unique 0-D structure (e.g. center of circular/spherical structures, edge intersections, plane intersections).

Also make sure that all instances without annotations lie in ignore regions, otherwise the network assumes they are background and will be confused.

auesro commented 5 years ago

Mmmm "not clip values"...I might have done that. I used the Levels tool in Photoshop so I think it does actually clip the values when you set the whites and blacks...

Here some screenshots from Unet: loss The loss seems to increase crazily after 6000 iterations...why?

segmentation iou

detection

Detection is really poor. I was training for 5 categories as you can see.

ThorstenFalk commented 5 years ago

The loss curves look weird. First, they drop to perfectly zero quite frequently, your positive samples are very sparse, right, so that the model has a good chance of guessing right, when guessing everything is background? Normalization probably makes it worse. This class imbalance will lead to overconfident gradient steps of Adam. it seems to learn, but I would definitely try a lower learning rate to avoid "explosions" as you observed them.

auesro commented 5 years ago

I tried reducing the learning rate to 1E-2...it didnt seem to help: e-2

Here is an example of what Im training on (spinal cord section stained for 2 markers): captura de pantalla de 2019-02-25 09-19-54 And a higher magnification of the annotation: captura de pantalla de 2019-02-25 09-21-01

Does it look so difficult?

ThorstenFalk commented 5 years ago

Thanks for the example images. The problem indeed looks not too complicated, although I am not 100% sure if I could tell the five classes apart from only seeing these images.

Two remarks:

auesro commented 5 years ago

You are right, the example image (for simplicity) does not contain the 5 classes, only 3 (green, red and green+red).

My mistake, of course now I increased instead of decreasing the learning rate...will give it another try BUT...it is always the small things...THANKS for asking such non-silly question....I had forgotten every time to invert the ignore region...my intention was to ignore everything OUTSIDE the magenta ROI but obviously I was ignoring everything inside, a hard time for the algorithm to learn anything!!

Again, thanks a lot!

I will give it another try with correct setup and will report back.

auesro commented 5 years ago

Well, much better now: captura de pantalla de 2019-02-26 10-47-51 However, still work to be done: -the green category is the least represented in the images, so I understand the low level of detection. -the black category is the most abundant in the training images so I dont understand the low levels of detection. -what the hell happened around the 8500 iteration?

ThorstenFalk commented 5 years ago

Whatever happened around iteration 8500 it was not very nice to your model. Probably a bad sequence of sample tiles with very similar statistics leading ADAM to overly increase the learning rate again, then a sample dropped in that had different statistics and the high learning rate displaced the network weights from the quite nicely learned configuration. It should recover, but may happen again. Dropping the learning rate further and/or increasing the tile size might help.

auesro commented 5 years ago

Thanks, Thorsten, I will try again with a few more training images, reducing the learning rate and increasing the tile size. If I were to run the training exactly the same, but instead of 10000, going for 8000 iterations...would solve this? (total lack of ADAM understanding here).

ThorstenFalk commented 5 years ago

Training is a random process, therefore with a second run you will get different behavior, maybe it is lacking the drop, or it is earlier or later or there are multiples. The only way to stabilize training is to give more data per iteration by increasing the tile size or reducing the gradient step (learning rate). You could in theory also apply regularization, but for this you would have to manually edit the model.

auesro commented 5 years ago

Ok, I did both. Increased tile size (732 x 732) and decreased the learning rate to 1E-5. The training is much more stable but the % of detection is worse: captura de pantalla de 2019-02-27 09-05-28 The right way to interpret it is that given a lower learning rate and the same ammount of iterations, the network did not have the time to learn to perform as good as the previous (albeit unstable) version?

ThorstenFalk commented 5 years ago

These curves look much better. IoU and Segmentation F1 indicate that the model is not yet converged, also validation loss is still steadily decreasing, so longer training with these settings should further improve results. Since both, decrease in learning rate and increase in tile size should have a positive effect on stability, you can also try to further finetune with learning rate of 5E-5 or even the original 1E-4 again to speed up training.

auesro commented 5 years ago

Thanks for all your input and help, Thorsten. Until now, every time I started finetuning from scratch, from the original 2D cell net v0 you provided. Are you suggesting to further finetune my current model (changing to a higher learning rate) or start from 0 with these settings? If you are suggesting the first path, can I use the same images or would I need to produce a new set of training images?

By the way, I think I figured what de-estabilized my model yesterday at 8500. But first I need to ask: is the network running through the images in the order I opened them in Fiji? Cause the last 4 images had inadvertedly switched channels for 1 of the classes (Pax2 is normally in the red channel, but those 4 had it in the green channel)... I guess that would dramatically affect what the network had learned until then...

ThorstenFalk commented 5 years ago

Random channel-permutation is of course not so easy to learn :). I would continue from your finetuned model with the same images.

auesro commented 5 years ago

Back reporting. So I kept finetuning the previous model for another 10000 iterations, at 1E-4 with the same tile size (732x732) and same images: captura de pantalla de 2019-02-28 09-23-29 Things dont seem to improve much, right? Mybe starting from scratch again, keeping the E-5 learning rate but running it for 20000 iterations would lead to a more significant performance in detection?Or the only way is to incorporate more training data?

ThorstenFalk commented 5 years ago

Well, they improve, but rate is quite low, but I doubt that training from 2d_cell_net_v0 or from scratch will lead to better results. The problem seems to be harder than expected.