Confirmation for some training details

strongwolf commented 4 years ago

Hi. I want to confirm some details in the second self-training stage. Are all the hyper-parameters (including the batch size, threshold for positive and negative, number of proposals in the RCNN head, etc. ) are the same for both supervised and unsupervised loss? Also, the unsupervised loss is imposed on both RPN and RCNN head? Thanks.

zizhaozhang commented 4 years ago

Yes, all hyper-paramters for data with human labels and data with pseodo labels (predicted offline) are treated the same.

It is simply by calling the forward function multiple times given pair of them https://github.com/google-research/ssl_detection/blob/master/detection/modeling/generalized_stac_rcnn.py#L181

strongwolf commented 4 years ago

Thank you very much. I have two more questions.
First, in the first stage, the learning schedule for '1x' is still 12k, 16k 18k iterations? If so, I think the number is too large for the 1,2,5,10% dataset. There will be over 100 epochs for batch size 8. Second, the second training stage is fine-tuned based on the first stage or trained based on Imagenet weights?

zizhaozhang commented 4 years ago

1: Yes your understanding is right, we kept the iterations for both 1st and 2nd stages. So for smaller number of labeled data, it is equal to have more epochs.

We train from imagenet weights with unlabeled data (with pseudo labels) and labeled data.

strongwolf commented 4 years ago

I have one more question. The number of unlabeled images is much larger than the labeled ones. If you have one labeled image and one unlabeled image in a batch for each GPU, one problem may arise that when the model under-fit the unlabeled data, the model has over-fitted the labeled data. I don't know whether this is an issue. When I try to reproduce the method using another framework, I found the model (10% label + 90% unlabel) performance doesn't increase a lot when decaying the learning rate at the 9th epoch. In contrast, with 10% label + 20 % unlabel setting, the model's performance will increase when decaying lr and results in a higher mAP than (10% label + 90% unlabel) .

zizhaozhang commented 4 years ago

Hi Thanks for the followup.

I am not quite sure what is the question here? underfitting or overfitting, generalization to your new framework? Or learning rate decay does not increase performance a lot? Would you mind elaborate more and seperate questions?

strongwolf commented 4 years ago

Hi Thanks for the followup.

I am not quite sure what is the question here? underfitting or overfitting, generalization to your new framework? Or learning rate decay does not increase performance a lot? Would you mind elaborate more and seperate questions?

I have trained your code and everything is fine. But when I reproduce it using another framework, some problems confuse me. The learning schedule is based on the unlabeled data. I decay lr at the 8th epoch. For the case 10% label and 90% unlabel, since the unlabeled data is 9 times of labeled data, 8 epochs for unlabeled data means 72 epochs for labeled data. I think 72 epochs are too long for 10% labeled data and the model has overfitted the 10% labeled data after 72 epochs, which I guess is the reason why the performance doesn't increase when decaying the lr. For the case 10% label and 20% unlabel, 8 epochs for unlabeled data means 16 epochs for labeled data. The performance will increase when decaying lr because 16 epochs is acceptable.

I am not sure how important the ration between the label data and unlabeled data in a batch. In the classification task, many papers claim that the batch size of unlabeled data should be larger than labeled data.

kihyuks commented 4 years ago

We decided to rely on training "steps" (e.g., 12k, 16k, 18k iterations) to determine training schedule instead of "epoch" in this work. In our experiments, we used the exact same number of training steps for 1, 2, 5, 10% labeled data settings. This might be suboptimal for certain settings, but we observe consistent performance improvement while preventing an additional effort for hyperparameter tuning.

We haven't tried to increase the size of unlabeled batch in this work due to a tight GPU memory budget, but it could be a good addition for possible performance boost.

Chrisfsj2051 commented 4 years ago

Hi @kihyuks and @zizhaozhang , it's great to see such an interesting work with remarkable results.

However, I meet some difficults in understanding training configs. According to the code, only one image is processed in a single "step". However, it seems like in stage2, a "step" contains two images (one labeled and one unlabeled). In this case, the number of training samples is steps in stage1 and steps*2 in stage2.

I would be really appreciate if you can point out whether I've correctly understood it or not.

zizhaozhang commented 4 years ago

@Chrisfsj2051 You understanding is correct, although the steps are the same, stage2 (ssl setting) view images more times than stage1.

google-research / ssl_detection

Confirmation for some training details #7