RandomSampler when creating the dataloaders

lorenmt / reco

The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast" [ICLR 2022].

https://shikun.io/projects/regional-contrast

Other

162 stars 25 forks source link

RandomSampler when creating the dataloaders #29

Closed nysp78 closed 1 year ago

nysp78 commented 1 year ago

Hello, I would like to ask why are you using the RandomSampler when the dataloaders are created for labeled and unlabeled data. For example, when the unlabeled data are more than the labeled, is sampled a specific number of samples from the unlabeled dataset in order to match the amount of the labeled data. Let say that we have 2000 labeled images and 9000 unlabeled images. From the set of 9000 we randomly select a subset of 2000 images at each training epoch, so we construct 2 dataloaders with the same length. Could this sampling make an appropriate use of the whole unlabeled dataset?

Many thanks

lorenmt commented 1 year ago

As you mentioned, we have different number of labelled and unlabelled images. So we use RandomSampler just to build up data loaders with the same number of samples. num_samples is the number of training data, as long as we have num_samples larger than the true data size, then we should be good to go.

nysp78 commented 1 year ago

Would it be more effective if we use the whole number of unlabeled images and have 2 dataloaders with differents sizes than only sample from a big pool of unlabeled data? I mean when the model would have better performance? if we use the whole unlabeled data for example 9000 or if we just sample randomly 2000 images from the entire 9000 to do the training?

lorenmt commented 1 year ago

If so, the two dataloaders will have different sizes. You may need to check/re-iterate the empty data loader which will make the implementation a bit more complicated.

nysp78 commented 1 year ago

Yes I see, but are these two approaches would have approximately the same performance in terms of model IoU? This is what I ask.

lorenmt commented 1 year ago

I see. But I don't know should be similar I assume.

nysp78 commented 1 year ago

Thanks for your replies, but let me ask something at last. Suppose that we have 5 labeled images and we sample another 5 unlabeled images to match the size. Could these 5 unlabeled images that are used in each training step improve the model's performance? I have read that the semi-supervised learning is effective when a large number of unlabeled images are involved during training.

lorenmt commented 1 year ago

Yes, semi-supervised learning is effective when you have a large number of unlabelled images. So to have the maximal performance, these 5 images should be different and sampled from a large dataset.