lmb-freiburg / Unet-Segmentation

The U-Net Segmentation plugin for Fiji (ImageJ)
https://lmb.informatik.uni-freiburg.de/resources/opensource/unet
GNU General Public License v3.0
87 stars 25 forks source link

Question: Many channel learning #20

Closed BolekZapiec closed 5 years ago

BolekZapiec commented 5 years ago

I'm not sure if this is a question better posed here or on image.sc so please just let me know if I should post it there instead.

So I have 3 color ISH tissue images with cells labeled (fairly brightly) in either, red, green, or blue. After finetuning the 2D cellnet with increasing number of data, it sometimes does a very good job, but sometimes seems to miss some obvious cells. The detection plateaus at about 60% and throwing another 10k epochs and set of images doesn't seem to improve it much.

I made a script to add some texture channels for each channel in the hopes of improving what the Unet can learn from. Basically added a Laplacian of Gaussian, Gaussian Gradient Magnitude, and Difference of Gaussian per channel, thus making the final image 4x3=12channel (RGB intensity + three texture channels per intensity channel) but the Unet doesn't seem to be terribly impressed with it.

When I attempted to finetune the original 2D Unet, it seemed to have quite poor performance even after 20k epochs (Tile ~700px, 1E-4, interval of 20) - basically it didn't seem to be learning anything. I then tried to finetune my already finetune 2D cellnet for the detecting the 3 cell types but it appears stuck at "splitting color channels (7% progress) for almost a day now.

Does the Unet plugin support so many channels? Is there any better way to add additional information like texture that works better than what I'm trying?

ThorstenFalk commented 5 years ago

The detection rate is the harmonic mean of precision and recall, so 0.6 may indicate missed cells, extra-segments or cell merging. If your cells are touching 0.6 is a fair result probably stemming from cell merging (could still be better), for isolated cells it's definitely poor.

U-Net was not impressed by the additional channels because if they are useful, U-Net learns the corresponding filters anyways, you only spared it the necessity to do so. Gradient magnitude may be an exception because it is not as easy to learn as the linear filters. As long as the receptive field of the network is sufficient, it will learn good filters for the task. However, if your images contain long-range dependencies, i.e. pixels in the upper left corner are required to explain pixels in the lower right, you must introduce these dependencies encoded into new channels. There is no restriction on the number of channels, besides memory.

When you say you trained for 20k epochs, you mean you trained for 20k * "number of train images" iterations? Can you provide an example training image, ideally the one the plugin got stuck at. "Splitting color channels" should be a matter of "no time", it simply turns the RGB stack into a hyperstack with 3 channels, so exactly what happens when clicking Image->Color->Make Composite. It is very strange that the plugin got stuck at this operation.

BolekZapiec commented 5 years ago

The detection challenge seems quite easier than the one from the demo which is why I'm surprised that i'm stuck at 60%. I believe a 10 year old could be trained to do the task with >90% performance which is why I feel like I'm the one failing as a good teacher. E.g. see image attached, the cells are either red, green, or blue and located in a somewhat distinct tissue layer (assuming U-net successfully picks up on the context aspect as well). While the false positives (like the 2 small red cells at the top of the image attached) are understandable, some of the red and blue cells missed below are inexplicable for me. image

Regarding the multi-channel image, I looked more carefully and the moment it was stuck at "splitting color channels" there was actually an error message I had missed: :file not in a supported format, a reader plugin is not available, or it was not found. Then it references one of the temp h5 files (unet-######) Not sure why since I had previously ran it with the same set of images to finetune a different model and it passed the initial step without this error. Unfortunately I can't retrieve this temp file now but if it's important I can try to replicate and copy it before it gets auto-deleted.

Based on your last reply I made a new set of images with less texture data (only intensity, "find edges", and DoG and attempted to fine tune a model that already had 0.6 performance on the RGB images. After 9000 iterations (sorry for referring to them as epochs earlier) of noisy learning it stopped with a Hungarian algorithm error about all matrix elements have to be non-negative (see attached) This is the error I typically get if I set the learning rate below 1E-4 but in this case did so despite being 1E-4. I suppose this fits well to what you had described about lacking momentum to stabilize the gradient - I guess the additional channels of texture info I tried to provide did more harm than good.

learningwithtexture

Aside from more training images, is there anything else i can do to try to improve performance above the current plateau of 0.6?

ThorstenFalk commented 5 years ago

I would not dare to say 0.9 even by a trained expert, but I agree, it should be better than 0.6.

Can you please color the markers in the shown result differently for the different classes to distinguish detection and classification errors? According to IoU every other iteration no blue marker is correctly identified. Since detection F1 does not drop to 0 for these iterations, it must be a localization problem. How precise are the annotations with respect to location? Can you provide an example image with train labels?

The current implementation of the detection pipeline assumes very precise marker locations, I will have to expose another parameter to the user that let's him also enter the localization uncertainty, currently it is hard-coded to 3 pixels, which is very strict and would explain most of the observed problems.

One of the most interesting detection results for me is the dividing cell slightly above the middle of the image. The network predicted two nuclei on opposite sides of the meta-phase plane, what did you give the network as annotations?

BolekZapiec commented 5 years ago

I'll admit 0.9 might be a bit of an exaggerated expectation but I'm glad you appreciate why I'm striving for improvement :-)

Sorry for not having the markers colored correctly last time around, below they've been marked with the correct color. My first attempts to apply Unet on these was through segmentation. The segmentation results were ok, but a quick test seemed to have the detection perform somewhat better. So the way I generated the training datasets was actually to repurpose segmentation datasets I had first generated. I took a table of all the segmentation ROIs and calculated the center (X,Y) + (Width/2, Height/2) basically. These I then loaded as ROIs and made into a point overlays for training. See below two image segments of what these looked like.

train1 train2

Ive trained more since generating that previous image, result below. detected1

though blue and red seems improved, green still seems to be missing obvious cells it was trained on detected2

I suppose this is the result of where the detection steps decides to cutoff false positive/negative cells? would moving the points in my training data be helpful? should i center them in a given signal blob rather than cell center (soma) which actually might have weaker signal than surrounding cytosol?

ThorstenFalk commented 5 years ago

First of all, whow you got segmentation ground truth for these data? This must have taken a while. I would have used the center of gravity of the masks instead of the the center of the bounding boxes, but this is not the problem I guess.

It might be because my eye is not able to see blue as well as the other colors, but aren't there many false positives in the blue annotation? There is certainly one false negative :).

BolekZapiec commented 5 years ago

The segmentation for the ground truth was done by hand, in Fiji using the annotation tools. Each cell was segmented as a separate object. This was the case for the first 5 images which are each as big, or 2-4x bigger than the one shown below. After I switched to detection, I added another 2 images, each about 5x than this image, but with the cells marked as points from the beginning by hand. The training/validation images are thus 1870x1338 up to 9500x8800 pixels (total of 7, 5 converted from segmentation).

trainsegmentation1

The blue signal is indeed somewhat weaker than the others, but when looking at it with the contrast turned up, I'd say there shouldn't be too many errors. I do see the false negative(s) in the point-detection training that you point out, this might be due to the automated way in which I converted the segmentation-annotation into the detection annotation. I guess the next step would be for me to re-analyze all the training dataset to correct for any FP/FNs? What would be the best strategy for tagging a given cell, dead in the middle of what I'd call a cell or is there some strategically good point to mark? Followup fine-tune on my finetuned model or start from the original 2d cell net to ditch any bad things the network has learned? I assume when you mentioned the network will learn whatever linear filters it needs, this includes relationships between channels (like a cell being only Green+ but not R+ or B+) meaning the only real data I'd need to feed it is the raw RGB image and training data without any mistakes?

ThorstenFalk commented 5 years ago

Ah, from the segmentation ROIs it becomes clearer. The one "false negative" is actually not a false negative but the very bright blue spot is surrounded by a large area of weaker blue signal, so that the center of the bounding box is offset from the signal peak.

Ideal annotation points don't allow for any localization inaccuracy, i.e. they are point structures in the image, like centers of spherical objects, corners, line intersections (2d) or plane intersections (3d) or the like. For cells with round nuclei I usually choose the center of the nucleus, if I know the nucleus contains exactly one nucleolus I pick the center of the nucleolus (but there are always exceptions to this rule, so I usually have to apply a backup rule for nuclei showing no nucleolus due to cell cycle phase or differentiation state). For neurons the soma are definitely the most prominent unique feature, but sometimes it is not visible in the recorded 2D slice, then the question is what to do. If you annotate the visible dendrites you risk multi-detections in other dendritic structures, if not, you might lose valuable information. That's the curse of learning algorithms, you have to teach what you want, and for this you first have to setup clear annotation rules for yourself.

Regarding the next steps, yes, I would first review the existing annotations. Simply inputting raw RGB and ROIs should be fine for training and correlations between channels and labels should be easily learned.

BolekZapiec commented 5 years ago

Thanks for your advice, I went through all training data and systematically tried to perfect the points used for training the detection. the systematic selection of marking cells seemed to give ~10percentage point improvement in detection. I'm still hopeful for more improvement but 10%points is already something quite good. Not sure if it's clear from the training image below as the bulk of the improvmenet happens "right away" in the first few iterations. snippingposttrainpositionimprovement-result-18febmodel

I had mentioned earlier "context" in the sense of all the actual cells being located in a region of the tissue (the epithelium" would it make sense to first train the algorithm to recognize this tissue segment to allow the Unet to recognize that the cells are always in an area with >% probability as "epithelium", or would it be better to pre-segment the epithelium as a segmentation task and include the weights as a 4th channel? Basically Im curious if the contextual info of a segmentation class that isn't used in the detection would contribute to the detection choice, or would it be better to force this information into the task by making it a whole channel?

ThorstenFalk commented 5 years ago

Additional related tasks can indeed help networks to focus on relevant information. It would be an interesting experiment to compare the mentioned modes of supervision. So, I encourage you to try it out, I have no clear favorite but would be interested in the outcome.

BolekZapiec commented 5 years ago

Thanks for your encouragement, I've tried it out. My testing seems to indicate that it's better to add the contextual information by first pre-segmenting the tissue of interest, masking it, then running the second step of segmentation/detection independently. Particularly as this let me do some morph operations on the resulting mask before masking the RGB image for the next step.

In the process I did encounter a few more questions about the data the plugin wants I was hoping you might be willing to enlighten me on:

1- For multi channel 2D images, does it matter if it's a composite or RGB? does it matter if some images (training or testing) are 8bit and others are 16bit? in other words, if I train the model with a 3x 16bit RGB composite, can it later apply it to different format and have it work?

2- For a 2D multi channel image, does it matter if the ROIs are on single channel (roi positions as they become relevant in 3D) or does it ignore this aspect? In other words, if I put a roi only on the green channel - will it only learn based on the info in that channel or will it still consider the other channels when building the model?

3- For a 3D segmentation, if the data is not isotropic (e.g. 1µm x 1µm x 5µm) is this a problem? I see in the supplemental data that dense 3D ground truth segmentation isn't so critical for the performance and sparse training with select sections here n there results in nearly as good results as dense segmentation training. Does this mean that 3D pixel relationships aren't critical as it models voxels as isotropes of xy pixels?

4-For generating 3D segmentation data, it's stated that the labels should be instantiated with a #, e.g. cell#1 cell#2. Is this supposed to result in a 3D segmentation of individual instances of objects (not necessitating morphological separation afterwards?) Is there any sample data with this working available anywhere?

Sorry for piling all these into a reply on this thread - let me know if you'd rather I split these into 4 new posts instead.

ThorstenFalk commented 5 years ago

Interesting questions:

  1. RGB will be converted to 3 channels automatically. The color information is not used, the channel order is important and must be the same as during training. Warning: If your training images had three channels, the first blue, the second green and the third red, an RGB image looking the same will be converted to first red, second green, third blue which is a channel permutation and will not be recognized! Each channel is independently normalized. 1b. Changing the bit depth is not ideal for the models but they should still produce reasonable outputs because the channels are min-max normalized and treated as floating point values. There are subtle differences in discretization, which a human does not recognize, but the network could. If you have images with different bit depth, ideally you train on examples of both.
  2. It does not matter in which channel you place the ROIs, the intensities of all channels of a specific spatio-temporal location are treated as one piece of information. If you want to train a model that can cope with missing channels you will have to set missing channels to value zero explicitly (Beware of division by zero during normalization, I have to check whether I considered this case).
  3. Anisotropy is no direct problem. You have different ways of approaching it. Usually I use 2D-only operations at highest resolution until voxels are approximately isotropic and then move on with 3D operations. You could also change the sampling during finetuning (by changing the process element size), so that the volume is rescaled to isotropic voxels before they are fed to the network. The latter has the advantage, that delineation of different objects in z-direction becomes easier, because usually this resolution is increased, but the architecture is heavier, because you have to use 3D operations throughout. Avoid out-of-plane rotations for augmentation though, because they produce unrealistic images.
  4. The final segmentation is still only a semantic segmentation, I am sorry, you will still need morphological operations. The instance labels are important to generate background ridges between the instances so that the network at least has a chance to generate separated binary masks that allow connected component labeling.