Training Results for YOLO Lesion Detection

cspino commented 4 months ago

I was able to train the YOLOv8 model on the Canproco database (version: bcd627ed4):

For training, I used 2112 slices, 957 of which contain lesions
For testing, I used 380 slices, 174 of which contain lesions

To track training progress, I used ClearML since it is easily integrated with the ultralytics yolo package.

Scripts

The yolo_training.py script is used to train a new model and the yolo_testing.py script is used to evaluate the model's performance on the test set.

Results

Test 1 - Default

All default parameters were used and mosaic data augmentation was turned off.

epochs: 150
batch: 16
lr0: 0.01 # initial learning rate
lrf: 0.01 # final learning rate
optimizer: auto --> adamW

Here are the metrics used to track the training process:

Here were the results: Recall	mAP50	mAP50-95	True positives	False negatives	False positives
0.163	0.174	0.0487	80	281	21

Labels	Predictions

Seeing these results, my first thought was that the contrast of the images needed to be enhanced to make the lesions more visible. So I tried adding a histogram equalization step for my next test.

Test 2 - With histogram equalization

For this test, when pre-processing the data, I used skimage's adaptive histogram equalization before saving each slice as a png.

Training parameters were kept the same.

Here is the training progress:

And here were the results: Recall	mAP50	mAP50-95	True positives	False negatives	False positives
0.28	0.247	0.0735	86	275	21

Labels	Predictions

Thoughts and next steps

For testing, the IoU (intersection over union) parameter was set to its default value (0.7), which I believe means that only predictions with an IoU above 0.7 were considered true positives. This might explain the discrepancy between the low metrics and visual results (although visually, other test batches did seem to have fewer correct detections than the one shown above).

I can think of 2 main reasons why the model isn't performing as well as it could:

The lesions are hard to see
Over half of my training set are slices that have no lesions (1155/2112) --> class imbalance

So for my next tests, I want to start by seeing how the IoU parameter influences metrics during testing. Then, I want to try reducing the number of empty slices in my training set. As for the contrast, the histogram equalization did seem to slightly improve results, but I'm not sure if there's maybe a better method to improve contrast.

jcohenadad commented 3 months ago

Great preliminary results! As you said, lesions are hard to see so I am not surprised that some are missed. But keep in mind that this is a 2D model, and if a lesion is not seen on one sagittal slice, it might be seen on the adjacent slice, which will still be useful for creating the 3D box that will be used to crop around the lesion and then run the segmentation model on a smaller 3D patch.

cspino commented 3 months ago

Here is a recap of the tests that I've tried and their results since my last update:

The new validation method was used to compare results (see issue #11)

Tests

1- Ratio of unlabeled slices in the train set

I trained two nano models with the same parameters, but one model was trained on the full training dataset (∼55% unlabeled) and the other model was trained on a dataset containing a 25% unlabeled ratio.		Recall (IoU 40%)	Precision (IoU 40%)	Recall (IoU 20%)	Precision (IoU 20%)
55% ratio	27.3%	47.3%	37.2%	64.5%
25% ratio	32.8%	40%	43.3%	52.9%

Recall increased, but precision decreased. Since the precision is much higher than the recall, I chose to keep working with the 25% dataset.

2- Confidence threshold at inference

I compared different confidence thresholds using the nano model trained on 25% unlabeled.		Recall (IoU 40%)	Precision (IoU 40%)	Recall (IoU 20%)
5% conf	32.1%	24.7%	47.7%	36.9%
10% conf	32.8%	40%	43.3%	52.9%
20% conf	31.7%	64.6%	36.5%	74.3%

These results seem to show that many lower confidence boxes are correct, but since they are harder lesions to detect, the boxes might not be as precise which is why the recall varies more with the 20% IoU threshold. And as expected, a lower confidence threshold reduces the precision.

3- Model depth

The ultralytics library has 5 different model depths available: nano (n), small (s), medium (m), large (l), extra-large (x). I compared n, s and x on the 25% unlabeled dataset:		Recall (IoU 40%)	Precision (IoU 40%)	Recall (IoU 20%)
YOLOv8n	32.8%	40%	43.3%	52.9%
YOLOv8m	36.9%	35.6%	48.5%	47.2%
YOLOv8x	37.2%	39.8%	47.4%	51.3%

The x model seems to perform the best, although the m model has similar results and is much faster for training and inference.

4- Hyperparameter sweeps

I used the integrated tune mode for these sweeps. This mode seems to only be supported for nano models, which isn't ideal considering the results were better with the larger models. But instead of re-implementing another sweep method, I chose to use the nano model and then try applying the final parameters to one of the larger models.

Learning rate

I found that an initial learning rate of 0.09 and a final learning rate of 0.08 performed the best.

Class and box loss

The box and cls parameters dictate the relative importance of the box and class losses. I used the best learning rates from the last sweep for these runs.

It seems like a higher box loss comparatively to the class loss yielded the best results. The run with the highest recall had box=15.65 and cls=4.06.

Augmentation: `scale` and rotation (`degrees`)

This sweep wasn't very conclusive, two runs with very similar parameters could lead to very different metrics:

Other

I also ran a sweep with a bunch of different parameters to see if I could land on a particularly successful run.

lr0, lrf, fliplr, translate, hsv_v, degrees, scale

I'm not too sure how to interpret the results, but it seems like the best runs had less augmentation.

Best models

From the different parameter search results, I tried applying this set of parameters to a medium-sized model:

lr0= 0.09
lrf= 0.08
fliplr= 0.25
translate= 0.25
hsv_v=0.45
degrees=10
scale=0.5

And I also validated the model from one of the runs with the highest recall. Those two models gave very similar metrics. The results are quite similar to the ones for the extra large model (with default params):

	Recall (IoU 40%)	Precision (IoU 40%)	Recall (IoU 20%)	Precision (IoU 20%)
YOLOv8m best params	36.9%	42.7%	46.7%	54.4%
YOLOv8x	37.2%	39.8%	47.4%	51.3%

It would be interesting to train an x model on the new set of params, not sure why I haven't done that yet..

Here are a few images of the results obtained with the YOLOv8x model (red boxes are labels and blue boxes are predictions):

sub-mon211_ses-M0_PSIR: 7 TP, 2FP, 1FN

This is one of the images with the most false negatives. There seems to be many small lesions that weren't detected: sub-tor035_ses-M0_PSIR: 5 TP, 1 FP, 11 FN

This is one of the images that had the biggest difference in TP, FP, FN between a 20% and 40% IoU threshold. sub-van212_ses-M0_PSIR 3TP, 5FP, 6FN with 40% IoU and 5TP, 2FP, 4FN with 20% IoU

plbenveniste commented 3 months ago

Great summary of your results ! I am pretty surprised that you got better results with less augmentation. What type of augmentation did you try ?

Also, for the final model, if you have time, it could be nice to have a PR-curve (precision-recall curve) and also the PR-AUC (precision-recall area-under-the-curve) score.

cspino commented 3 months ago

In the sweep with the many parameters, I had left-right flipping, translation, value (image intensity), rotation and scaling. I'm not so sure that I came to the correct conclusion though when analyzing that sweep's results... It might be worth doing sweeps with fewer augmentation parameters at a time to be able to actually see the effect of each augmentation type. But then again, I did a sweep for just scale and rotation, and those results weren't conclusive either. Perhaps they don't have a significant effect.

cspino commented 2 months ago

I calculated the PR-AUC for the models that seemed to perform the best:

	PR-AUC (IoU 40%)	PR-AUC (IoU 20%)
YOLOv8n loss_sweep8	0.131	0.234
YOLOv8m best params	0.111	0.206
YOLOv8x best params	0.110	0.223

The nano model obtained during my loss sweep (loss_sweep8) has the highest score. Here were its training parameters:

box=15.65
cls=4.06
lr0= 0.09
lrf= 0.08
fliplr= 0.25
translate= 0.25
hsv_v=0.45
degrees=10
scale=0.5

I took the parameters from that run and did a new parameter sweep with the degrees and translate parameters. The best run from that sweep (transform_sweep5) had the following parameters:

translate= 0.453
degrees= 12.55

And the PR-AUC increased slightly with a 20% IoU:

	PR-AUC (IoU 40%)	PR-AUC (IoU 20%)
YOLOv8n loss_sweep8	0.131	0.234
YOLOv8n transform_sweep5	0.129	0.252

Here are the PR curves for the `transform_sweep5` model with 0.4 and 0.2 IoU thresholds: 40% IoU	20% IoU

Here are a few results (Red are labels and Blue are predictions):

sub-van151_ses-M0_PSIR best_van151-M0

sub-tor135_ses-M0_PSIR best_tor135-M0

sub-mon209_ses-M12_PSIR best_mon209-M12

ivadomed / ms-lesion-agnostic