Questions about Model Training

helia-mohamadi commented 1 month ago

Hi Justin,

Thank you for your valuable work. I have a few questions regarding model training.

Question 1: Could you please explain the formula for calculating the number of epochs needed based on the number of iterations, given the number of images and the batch size (which is 16 by default in the code)?

Question 2: Is there a mathematical relationship between the number of iterations and steps? I mean this part: SOLVER: STEPS: (14999,) MAX_ITER: 15000

Question 3: For calculating iterations during model training in the domain adaptation phase, should we consider the number of source data and target data (using the sum of these two datasets as the data amount for this phase)?

justinkay commented 1 month ago

Hi @helia-mohamadi, thank you for your interest.

Question 1:

The default "total batch size" is actually 48 in the configs supplied: https://github.com/justinkay/aldi/blob/5e9f993ea486fdf64b2ea82cf0bd01532ec0940d/configs/Base-RCNN-FPN.yaml#L15

We chose this value so that we could fairly compare different ratios of source:target data as in Figure 6c in the paper. Depending on your dataset this may not be the optimal batch size for best results, just FYI.

The total batch size is then divided between source and target data based on the batch_contents and batch_ratios specified the config: https://github.com/justinkay/aldi/blob/5e9f993ea486fdf64b2ea82cf0bd01532ec0940d/aldi/trainer.py#L212-L222

In the default source-only and oracle configs this means your batch size would be 48 images in source or target, respectively. So iterations = epochs*48.

In the ALDI configs, by default you will have 24 images from source and 24 from target. So iterations = epochs*24.

Question 2:

STEPS and MAX_ITER are some Detectron2 defaults, they are a bit confusing. MAX_ITER is what you want -- the total number of training iterations. STEPS is a tuple used for the default learning rate scheduler, and determines when the learning rate is decreased: https://github.com/facebookresearch/detectron2/blob/2a420edb307c9bdf640f036d3b196bed474b8593/detectron2/solver/build.py#L289-L290

We keep a constant LR the whole time, so that is why STEPS is always a tuple with one entry = MAX_ITER - 1. It's a bit of a hack to avoid changing the LR scheduler.

Question 3:

I think this should now be answered in question 1.

Thanks for your interest and please let me know if you have any further questions.

helia-mohamadi commented 1 month ago

Hello again, thank you for your attention and for answering my questions.

To ensure that my uncertainties do not lead me to mistakes in the future, I would like to perform these calculations for my own data and get your confirmation.

Let's assume I have 30,000 source images and 43,000 target images. I am also using a single 4090 GPU to train the model.

With these assumptions, for the source-only phase and a desired 5 epochs, I should have 3,125 iterations. Also, for the domain adaptation phase and a desired 5 epochs, I should use 7,605 iterations. Is that correct?

justinkay commented 1 month ago

for the source-only phase and a desired 5 epochs, I should have 3,125 iterations

Yes!

for the domain adaptation phase and a desired 5 epochs, I should use 7,605 iterations

During domain adaptation the batch of 48 will be split between source and target (by default, 24 images from each). So for 5 epochs through your source data, this would mean 5*30,000/24 = 6250 iterations; or for 5 epochs through your target data, this would be 5*43000/24 = 8958 iterations. Since your source and target datasets are different sizes the epoch length (in iterations) won't match up unless you change the batch ratio, but we have found that keeping a 1:1 source:target batch ratio leads to best results so I would recommend keeping that default.

helia-mohamadi commented 1 month ago

Thank you. So, I can choose between 6250 or 8958 if I want to keep a 1:1 batch ratio. (I think I have to use 8958 to ensures that the larger target dataset is fully utilized during training?) But if I want to match them and change the batch ratio, with what ratio should this amount be changed?

justinkay commented 1 month ago

Yes, if you want to ensure you train for a specific number of epochs on source or target with a 1:1 batch ratio, you would need to make your total iterations a multiple of either 6250 or 8958.

To perfectly match them, you would need a batch ratio of 6250:8958. This obviously doesn't divide nicely so you'd need to approximate it. Something like 2:3 would be closer. But you'd need to make sure everything divides nicely with SOLVER.IMS_PER_BATCH, SOLVER.IMS_PER_GPU, and the number of GPUs you train with.

For example if you set SOLVER.IMS_PER_BATCH = 50, SOLVER.IMS_PER_GPU = 2, and DATASETS.BATCH_RATIOS = (2,3), this would wind up with 20 source images and 30 target images per training step (using gradient accumulation). This would be okay if training with 1 GPU because both 20 and 30 are divisible by SOLVER.IMS_PER_GPU*num_gpus = 2, but not with 2 GPUs, because 30 is not divisible by SOLVER.IMS_PER_GPU*num_gpus = 4. Make sense?

justinkay commented 1 month ago

Hi @helia-mohamadi, closing this for now, please let us know if you have any further questions!

helia-mohamadi commented 1 month ago

Thank you very much for your response. I will try different modes, and if I have any more questions, I will definitely ask you.

helia-mohamadi commented 1 week ago

Hello again, First of all, I tried 6250 iterations, but the accuracy of phase 1 and phase 2 were almost the same (about 58.28)

Then I tried to change the batch ratio to 2:3 but I got this error:

assert sum(batch_sizes) == total_batch_size, f"sum(batch_sizes)={sum(batch_sizes)} must equal total_batch_size={total_batch_size}" AssertionError: sum(batch_sizes)=47 must equal total_batch_size=48

this is my config:

BASE: "./Base-RCNN-FPN-ship_strongaug_ema.yaml" MODEL: WEIGHTS: "/home/Codes/ALDI/output/1-1/src_val_model_best.pth" EMA: ENABLED: True DATASETS: UNLABELED: ("tgt_train",) BATCH_CONTENTS: ("labeled_strong", "unlabeled_strong") BATCH_RATIOS: (2,3) DOMAIN_ADAPT: TEACHER: ENABLED: True DISTILL: HARD_ROIH_CLS_ENABLED: False HARD_ROIH_REG_ENABLED: False HARD_OBJ_ENABLED: False HARD_RPN_REG_ENABLED: False ROIH_CLS_ENABLED: True OBJ_ENABLED: True ROIH_REG_ENABLED: True RPN_REG_ENABLED: True AUG: LABELED_INCLUDE_RANDOM_ERASING: True UNLABELED_INCLUDE_RANDOM_ERASING: False LABELED_MIC_AUG: False UNLABELED_MIC_AUG: True SOLVER: STEPS: (15344,) MAX_ITER: 15345 BACKWARD_AT_END: False OUTPUT_DIR: "/home/Codes/ALDI/output/1-2/2,3/"

When I set the batch ratio to 1:2, I didn't get this error anymore (I need a 2:3 ratio).

justinkay commented 1 week ago

Hi @helia-mohamadi , can you please post your entire config.yaml and log files? Thanks.

helia-mohamadi commented 1 week ago

Of course, here you go: log.txt config.txt

helia-mohamadi commented 4 days ago

Hello again, I have also another question about model training: I trained the first phase three times with almost same iterations. But when I test them, first time I got 30% for map 0.5, second 64% and the tried time I got 58%. why this situation happened? and how can I How can I prevent this from happening in the future? Because I need a reassurance of the outcome and not a randomness for this section. Thank you.

justinkay commented 4 days ago

Of course, here you go: log.txt config.txt

Hi @helia-mohamadi, this issue is because your IMS_PER_BATCH (48) does not divide evenly according to your batch ratios (2,3). There is no way to divide 48 into two integers with the ratio 2:3. You could try an IMS_PER_BATCH of 40 and that should work for you.

Hello again, I have also another question about model training: I trained the first phase three times with almost same iterations. But when I test them, first time I got 30% for map 0.5, second 64% and the tried time I got 58%. why this situation happened? and how can I How can I prevent this from happening in the future? Because I need a reassurance of the outcome and not a randomness for this section. Thank you.

Because you are using a custom dataset there is probably not much advice I can offer you, but if you post the training curves here I can take a look. The first phase is mostly just standard object detector training, so you may want to try tweaking the typical training hyperparameters (learning rate, training for more/less steps, etc.) until you get satisfactory results.

justinkay / aldi

Questions about Model Training #23