Closed jlause closed 5 years ago
Does this only happen with validation set or does training also slow down without? Training and validation are running in a separate process, thus ImageJ memory usage can not affect training and validation speed (except if you run into swapping).
So I quite confidently say: No, slowdown is not related to ImageJ memory usage.
Virtual memory increase during finetuning is possible due to the plot updates and because all outputs of the caffe process are appended to a String (and thus a full copy of the string is generated during append), which may cause virtual memory overhead and a slowdown in progress display for longer runs. Replacing the String by an extensible structure to avoid excessive re-allocations and copies is on my list.
Hey, thanks for the fast reply.
For replication and to answer your question, I timed 3 more finetuning runs, two without validation, one with.
Regarding the RAM issue, I can confirm that the heavy load on the client PC disappears for runs without validation.
Does this only happen with validation set or does training also slow down without?
It seems that it does not happen without validation set; I timed how long 100 iterations took in the beginning and towards the end of 5000-iteration-runs without validation set, and they always took around 14sec for my current settings.
In the finetuning replication run with validation, I again ran 5000 iterations with a validation every 100 iterations. I observed the following validation durations over the course of the run:
This does not look like the clear monotonic increase that I described before... I wondered if maybe the 1:1-mapping that is required for computing IoU / F-measure takes longer/shorter depending on how many segments the network finds. If that changes over the course of the training, can it explain the observed changes in duration?
Validation plot for this run:
F1 measures do the following: F1 segmentation:
F1 detection:
0 1 0 1 0 1
0 0 0 0 0 0
0 1 0 1 0 1
this threshold is quite ineffective and leaves many false positives, which can be expensive in the next step. In reality this pattern can occur due to the up-convolutions, but I only observed this for very early iterations during from scratch training or when labels were completely inconsistent (basically random or from completely different images), or if the number of output channels was setup to fewer classes than there are labels (which cannot be the case, since this is set programmatically in the plugin).
For non-pathological cases, both should scale linearly with image size and approximately cubically with the number of ground truth segments. Since your curves look reasonable, with F1 scores > 0.1 the imbalance in the number of GT segments and predicted segments cannot be so high.
Hey all,
I am using the ImageJ plugin to finetune the available 2D network to my own segmentation problem.
I ran a finetuning for 10k iterations with relatively small tiles (188x396 tiles of 1024x1024-images with 0.541 microns/pixel resolution) overnight, using a remote GPU. Here is the exact settings I used:
Next morning, I noticed that the training got "stuck" near iteration 1200. It seemed that the validation slowed down the training a lot: a single validation took on the order of 30 minutes, the 20 iterations in between only a few seconds. This was 16 hours after I started the training, and in that time, roughly 60 validations had been computed. Assuming that preprocessing and the training validations alone don't take much time, that gives an expected validation duration of ca. 15min. The validations that I observed took about twice as long, so I suspected that the validation time somehow increased over the training course. At the same time, I noticed that the RAM of my client PC was almost exclusively used by ImageJ (no other programs were active at the same time), which surprised me, because I thought most of the processing was done at the remote host with GPUs.
At that point, I aborted training, here's the training loss/validation time course:
In order to reproduce this issue, I ran a finetuning with almost the same images/setting, except that I used slightly larger tiles (396x220) and fewer validations to speed up the preprocessing etc. Exact settings:
I monitored the RAM usage of ImageJ during the whole preprocessing and finetuning (using the Monitor Memory function under "Plugins/Utilities"), and the global CPU/RAM use of the client PC (using my gnome system monitor). My ImageJ memory is limited to 4.5 GB (under Edit/Options/Memory&Threads).
I noted that during preprocessing, ImageJ RAM usage was below 200MB at all times. When finetuning started, ImageJ RAM usage slowly grew over the first 1k iterations to a peak usage of 3.8 GB. When it hit the limit, the used memory somehow "reset" back to a few hundred MB, and then irregularly changed between 300 and 2000 MB. At the same time, global RAM display (starting at ~2GB) went up until 6GB during the first 1k iterations, and did not reset when the ImageJ internal RAM display "reset".
I then (manually) timed the validations, noting that they took increasingly longer (each validation timed took about 1sec longer than the validation before). The training iterations between two validations seemed to always take the same time.
I tried to use the ImageJ garbage collection (clicking on the status bar is supposed to release unused memory to the OS) to speed up the validations again. Clicking repeatedly led to a visible release of RAM, but that was instantly filled up again:
A single click had a much smaller effect on RAM:
With respect to reducing the validation duration, using multiple garbage collections worked to a certain degree: I observed a decrease of the validation duration from >30sec to ca. 20sec, but after a while of staying at 20sec/validation, the validation durations started to increase again. I tried this several times (using the garbage collection and then timing the validations), with similar results.
This seemed to fit the observation from my first case (increasingly slow validations), although the increase was not as exponential as I expected, and also shows some connection to the (for me: unexpectedly) strong use of client PC RAM.
My final questions are: Do you think that this RAM issue is related to why my validations were so slow in the first place? If so, is there a way to integrate regular garbage collection into the UNET routine, or do some other thing to improve memory-handling?
Thanks in advance for all kinds of suggestions and reading this potentially over-detailed post ;) Cheers, Jan
EDIT: In case that is of relevance, I also include:
Training course plots for the finetuning session during which I timed the validations and monitored the RAM usage:
RAM usage after finetuning was over:
logfile: