Closed ecm200 closed 4 years ago
I have no idea about the divergence.
@ZwwWayne a further update to this issue.
When I use the same training dataset, and seed for image selection, but use a ResNeXt101 based backbone, as opposed to the previously mentioned ResNet architectures, I appear to get more stable training.
The above image shows various models used with ResNet50 and 100 architectures. All but 1 of the ResNet models used the standard learning rate of 0.02 scaled by 1/8 factor (0.0025) due to me using only a single GPU (following the Linear Scaling rule by Goyal et al (2018)). The orange model (ResNet50) that eventually fails with NAN loss at ~42k minibatches and has a loss convergence the same as the ResNeXt101 model (grey) that doens't diverge, both had a learing rate of 0.01 (0.00125).
This case some times indicates that your data might not be very clean, thus the model simply blows up at some cases. There are several things you could try: 1. Add gradient clip to restrict the gradient to be smaller than a number, e.g. 35. In this case, you could observe the gradient norm of each iteration and you could get some hints when loss goes Nan. 2. Use smaller learning rate, some times these hyper parameters need to be tuned. 3. Check the data by visualization. You need to make sure that there is no isues (e.g., out of range box, zero area box) in the annotation and data.
@ZwwWayne thank very much for the suggestions, these are very helpful.
I am glad to say that my investigations and thoughts are along the lines you have suggested (bar the clipped gradient idea). I have suspected that this is either a data issue or a gradient stability issue (which of course could be manifest of the data issue, these are not mutually exclusive).
With regards to the data, I have made an effort to ensure that no label data has points outside the image frame, this is something I experienced as an earlier issue with more simple geometries of the input image objects (spheroids as opposed to cuboids). However, I did not explicitly consider the possibility of zero area bounding boxes / polygons.
I will report back with my findings, so I will leave this issue open for the moment.
Thanks again.
@ZwwWayne an update for you.
Firstly, with regards to the input data quality, I made an additional check on the training data regarding labelling, paying particular close attention to the smallest bounding box and masks. Whilst I had previously checked to ensure that no bounding boxes were 0.0, I had failed to ensure that the resulting objects were not smaller than a single pixel, which resulted in approximately 1600 object annotations (out of 1.3 million) that had non zero bounding box areas but 0 pixel area masks. These were quite well spread amongst the training set, with a maximum of 10 of these in any one image (with there being between 350 and 500 objects per image). So I have removed these labels from the training set.
Running a ResNeXt101 Mask R-CNN model, this time using 2X GPUs (deep blue), has resulted in better performance of stable training over 72 epochs, which was also improved over the 1X GPU (deep green) training run of 36 epochs. (Note the default learning rate of 0.01@8XGPU was linearly scaled with the relative mini-batch size of 16 (2/16 or 4/16 factor for the 1X and 2X GPU respectively). This is a significant improvement over the behaviour encountered when training with the dataset with these 0 pixel area masks present (red, orange and grey).
Combined Loss
BBOX mAP
Hi @ecm200 , Thanks for your feedback and congrats! We are planning to reorganize our documentation with more detailed tutorials for users to debug and begin their own projects. Your experience and report here are very valuable to this.
Since the issue seems to be resolved, this issue will be closed. The valuable experience will be included in the documentation in the future. Feel free to reopen this issue if you have any further questions.
Hi @ecm200 , Thanks for your feedback and congrats! We are planning to reorganize our documentation with more detailed tutorials for users to debug and begin their own projects. Your experience and report here are very valuable to this.
Since the issue seems to be resolved, this issue will be closed. The valuable experience will be included in the documentation in the future. Feel free to reopen this issue if you have any further questions.
@ZwwWayne you are most welcome, and it is an absolute pleasure to provide feedback on this fantastic project. I should be thanking you and the rest of development team!
Please let me know if there's anything more I can do to assist.
Issue Description
I have been successfully training a Mask R-CNN ResNet50FPN (and 101) on a custom dataset of particle images using simple spheroid shapes (green loss functions in graph below). On moving to more complex shapes, such as cuboids, I have found that the model trains successfully for a while, and then diverges and eventually causes NAN in the loss functions. The only difference between the training of the models is the input data particles shapes. The images themselves, the bounding boxes and polygon information, the COCO dataset production has been kept the same as with the successful training of the spheroid particles.
I would welcome any insight that people might have with regards to what could be causing divergent behaviour in the training process. I have found that diverging loss functions occurs for both ResNet50 and 101 variants so far. I have been using the standard learning rate of 0.02 scaled down from a mini-batch of 16 images on 8 gpus, down to 1 gpu using a mini-batch of 2, LR= 0.02/8. Q. Could this be an issue relating to the more complexity in the image, needing a smaller learning rate?
My images have a range of particle content in them, with some having very small amounts of particles in the training image. Q. Would an excess of these types of images, where is relatively little object information potentially cause a gradient issue and divergence later in the process?
The cuboid dataset was training with ~3800 images and validating with 1500 images, so the divergence does not occur straight away, in this example it happened around the 15th epoch, so therefore the network had seen all images in the training set 15 times before the divergent behaviour sets in. Looking at the individual loss functions, it shows that the divergence is present in all loss functions.
As well attaching all the script, customized functions and dataset information, I also at the bottom append the output logs of the training process.
Training loss
Validation loss
Other individual loss functions
Model Configuration
The model configuration is heavily based on the standard COCO instance segmentation problem using variants of the Mask R-CNN architecture with ResNet FPN, the modifications to standard
Reproduction
I am running a custom training script that is heavily based on the training script example shipped with MMDetection.
Custom Code
I have made a custom dataset type, which is heavily based on the COCO dataset type as I have converted my custom dataset into the COCO format, with the same directory structure, and annotations saved into a JSON files for train and test.
I have written a custom image loading hook due to nature of the images being single channel.
Dataset
My dataset is a custom dataset which comprises of images of particles of different shapes and sizes. There is only 1 class of object, particle.
Environment
Output logs of training