Trouble reproducing numbers in paper

corey-snyder commented 3 years ago

Hello,

I am also having trouble reproducing the numbers shown in the paper for CIFAR-10. I corrected a bug in the main.py where the "--e" option for number of epochs was set to 20 epochs; however, after increasing to 200 epochs for training the ResNet18 model, I still cannot reproduce numbers for multiple active learning algorithms in the repository.

I would appreciate if the authors could verify that this repository reproduces numbers from the paper on CIFAR-10.

For reference, I have provided the numbers I am getting on CIFAR-10 across 5 trials for random sampling, UncertainGCN, CoreGCN, CoreSet, and lloss below:

Random: Format: (N_labeled): mean +/- std. dev. ('1000'): 47.3240 +/- 0.8149 ('2000'): 58.2420 +/- 2.9056 ('3000'): 66.7560 +/- 1.8256 ('4000'): 73.1400 +/- 2.2483 ('5000'): 76.5840 +/- 1.3432

UncertainGCN: Format: (N_labeled): mean +/- std. dev. ('1000'): 48.2180 +/- 4.0055 ('2000'): 56.7860 +/- 1.9462 ('3000'): 66.4640 +/- 2.8965 ('4000'): 75.3840 +/- 1.1915 ('5000'): 79.9500 +/- 1.1942

CoreGCN: Format: (N_labeled): mean +/- std. dev. ('1000'): 48.0840 +/- 1.1299 ('2000'): 56.9900 +/- 3.5209 ('3000'): 70.1080 +/- 2.7946 ('4000'): 74.8480 +/- 1.9183 ('5000'): 80.8940 +/- 1.4228

CoreSet: ('1000'): 46.2820 +/- 2.5461 ('2000'): 59.5800 +/- 2.0735 ('3000'): 69.4400 +/- 1.9787 ('4000'): 75.4380 +/- 1.0835 ('5000'): 80.2420 +/- 1.2786

lloss: Format: (N_labeled): mean +/- std. dev. ('1000'): 46.6760 +/- 1.5340 ('2000'): 62.6260 +/- 1.5213 ('3000'): 70.5440 +/- 2.7696 ('4000'): 76.1500 +/- 1.7544 ('5000'): 80.8800 +/- 1.3885

Thank you for your time!

razvancaramalau commented 3 years ago

Hi, These numbers might vary depending on the randomization of the specific processor/os that you're running the experiments on. I've observed these changes when swapping to different machines.

corey-snyder commented 3 years ago

Thank you for the quick response.

I understand that different random seems for trials and different cpu/gpu configuration can yield different results. However, UncertainGCN does not outperform random sampling with statistical significance until we reach 5000 points after 4 acquisition stages in my results.

I would be less concerned if all numbers were lower than as reported in the paper, but with similar relationships between methods, e.g. CoreGCN and CoreSet clearly outperforming random sampling throughout. The results I'm seeing have all methods fairly tight instead.

I will re-generate the results for a few methods on two separate gpus to see if a different device makes a noticeable difference.

In the meantime, I am also curious what level of training accuracy the GCN should achieve while training for UncertainGCN or CoreGCN acquisition? I am seeing ~80% classification accuracy for the binary labeled/unlabeled classification task. Is this reasonable or should the GCN be able to get closer to 100%?

Thank you again for your time!

corey-snyder commented 3 years ago

Hello again,

After generating results for Random, UncertainGCN, CoreGCN, and CoreSet acquisition on two separate GPUs, I am not seeing a statistically significant difference in the results across devices. I am still observing that the first couple stages of active learning do not outperform the random baseline.

I am training the ResNet18 model for 200 epochs with the provided LR reduction milestone at 160 epochs. Is it possible I should train the base model longer? Any guidance would be much appreciated.

razvancaramalau / Sequential-GCN-for-Active-Learning

Trouble reproducing numbers in paper #5