siit-vtt / semi-supervised-learning-pytorch

Several SSL methods (Pi model, Mean Teacher) are implemented in pytorch
MIT License
82 stars 27 forks source link

How is the result compared to the paper realistic evaluation of deep ssl learning algorithms #1

Open sunfanyunn opened 5 years ago

heitorrapela commented 5 years ago

Good question, I updated and run the baseline but I had to stop the training. I need to run for 1200 epochs? With 35 I was having good results on CIFAR-10

Jongchan commented 5 years ago
python train.py -a=wideresnet -m=baseline -o=adam -b=225 --dataset=cifar10_zca --gpu=6,7 --lr=0.003 --boundary=0
Epoch: [1199][0/18] Time 0.677 (0.677)  Data 0.580 (0.580)  Loss 0.0085 (0.0085)    Prec@1 99.556 (99.556)  Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.585 (0.585)  Data 0.491 (0.491)  Loss 0.0145 (0.0145)    Prec@1 99.556 (99.556)  Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.659 (0.659)  Data 0.580 (0.580)  Loss 0.0081 (0.0081)    Prec@1 100.000 (100.000)    Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.692 (0.692)  Data 0.611 (0.611)  Loss 0.0058 (0.0058)    Prec@1 100.000 (100.000)    Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.637 (0.637)  Data 0.548 (0.548)  Loss 0.0158 (0.0158)    Prec@1 100.000 (100.000)    Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.629 (0.629)  Data 0.546 (0.546)  Loss 0.0251 (0.0251)    Prec@1 99.556 (99.556)  Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.609 (0.609)  Data 0.518 (0.518)  Loss 0.0037 (0.0037)    Prec@1 100.000 (100.000)    Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.599 (0.599)  Data 0.516 (0.516)  Loss 0.0107 (0.0107)    Prec@1 100.000 (100.000)    Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.629 (0.629)  Data 0.544 (0.544)  Loss 0.0212 (0.0212)    Prec@1 98.667 (98.667)  Prec@5 100.000 (100.000)
Epoch: [1199][0/18] Time 0.647 (0.647)  Data 0.566 (0.566)  Loss 0.0350 (0.0350)    Prec@1 98.667 (98.667)  Prec@5 100.000 (100.000)
Valid: [0/23]   Time 0.383 (0.383)  Loss 1.0591 (1.0591)    Prec@1 78.667 (78.667)  Prec@5 97.778 (97.778)
 ****** Prec@1 78.360 Prec@5 97.440 Loss 1.198 
Test: [0/45]    Time 0.457 (0.457)  Loss 1.1571 (1.1571)    Prec@1 77.333 (77.333)  Prec@5 98.222 (98.222)
 ****** Prec@1 75.990 Prec@5 97.670 Loss 1.342 
Best test precision: 77.950

I have just finished running the baseline code, and the top precision is 77.95% (I would expect around 80% according to the paper result). I suspect the L1 regularization has little effect since it is averaged over millions of parameters.

Dear @heitorrapela , what is the result you've got?

P.S. just found out that the widen factor in the code is 3... why..?

heitorrapela commented 5 years ago

@Jongchan , I didn't finish the training. But it seems to be good the result that you got.

Jongchan commented 5 years ago

WRN-28-2 (not 3!)

@heitorrapela Just got the result over the weekend.

Mean Teacher

python train.py -a=wideresnet -m=mt -o=adam -b=225 --dataset=cifar10_zca --gpu=6,7 --lr=0.0004 --boundary=0

Last log

Learning rate schedule for Adam
Learning rate: 0.000080
Mean Teacher model 
Epoch: [1199][0/367]    Time 1.057 (1.057)  Data 0.948 (0.948)  Loss 0.0087 (0.0087)    LossCL 0.0040 (0.0040)  Prec@1 100.000 (100.000)    Prec@5 100.000 (100.000)    PrecT@1 100.000 (100.000)   PrecT@5 100.000 (100.000)
Epoch: [1199][100/367]  Time 0.091 (0.119)  Data 0.000 (0.023)  Loss 0.0083 (0.0083)    LossCL 0.0052 (0.0063)  Prec@1 100.000 (99.982) Prec@5 100.000 (100.000)    PrecT@1 100.000 (99.973)    PrecT@5 100.000 (100.000)
Epoch: [1199][200/367]  Time 0.096 (0.115)  Data 0.000 (0.020)  Loss 0.0063 (0.0086)    LossCL 0.0054 (0.0068)  Prec@1 100.000 (99.978) Prec@5 100.000 (99.996) PrecT@1 100.000 (99.973)    PrecT@5 100.000 (100.000)
Epoch: [1199][300/367]  Time 0.098 (0.115)  Data 0.000 (0.020)  Loss 0.0075 (0.0084)    LossCL 0.0043 (0.0065)  Prec@1 100.000 (99.982) Prec@5 100.000 (99.997) PrecT@1 100.000 (99.982)    PrecT@5 100.000 (100.000)
Valid: [0/23]   Time 0.471 (0.471)  Loss 0.7086 (0.7086)    Prec@1 84.889 (84.889)  Prec@5 97.333 (97.333)
 ****** Prec@1 78.800 Prec@5 96.380 Loss 1.038 
Test: [0/45]    Time 0.483 (0.483)  Loss 0.8086 (0.8086)    Prec@1 82.667 (82.667)  Prec@5 96.000 (96.000)
 ****** Prec@1 77.630 Prec@5 96.810 Loss 1.077 
Valid: [0/23]   Time 0.467 (0.467)  Loss 0.7076 (0.7076)    Prec@1 84.444 (84.444)  Prec@5 96.444 (96.444)
 ****** Prec@1 78.820 Prec@5 96.440 Loss 1.033 
Test: [0/45]    Time 0.437 (0.437)  Loss 0.8259 (0.8259)    Prec@1 81.333 (81.333)  Prec@5 96.444 (96.444)
 ****** Prec@1 77.920 Prec@5 96.690 Loss 1.073 
Best test precision: 79.490

PI model

python train.py -a=wideresnet -m=pi -o=adam -b=225 --dataset=cifar10_zca --gpu=6,7 --lr=0.0003 --boundary=0

Last log

Learning rate schedule for Adam
Learning rate: 0.000060
Pi model 
Epoch: [1199][0/367]    Time 1.046 (1.046)  Data 0.924 (0.924)  Loss 0.0090 (0.0090)    LossPi 0.0022 (0.0022)  Prec@1 100.000 (100.000)    Prec@5 100.000 (100.000)
Epoch: [1199][100/367]  Time 0.146 (0.138)  Data 0.000 (0.024)  Loss 0.0109 (0.0089)    LossPi 0.0018 (0.0028)  Prec@1 100.000 (99.982) Prec@5 100.000 (100.000)
Epoch: [1199][200/367]  Time 0.119 (0.141)  Data 0.000 (0.022)  Loss 0.0072 (0.0085)    LossPi 0.0009 (0.0030)  Prec@1 100.000 (99.987) Prec@5 100.000 (100.000)
Epoch: [1199][300/367]  Time 0.123 (0.141)  Data 0.000 (0.021)  Loss 0.0079 (0.0084)    LossPi 0.0011 (0.0029)  Prec@1 100.000 (99.991) Prec@5 100.000 (100.000)
Valid: [0/23]   Time 0.509 (0.509)  Loss 0.7784 (0.7784)    Prec@1 84.000 (84.000)  Prec@5 96.889 (96.889)
 ****** Prec@1 79.180 Prec@5 97.220 Loss 0.947 
Test: [0/45]    Time 0.463 (0.463)  Loss 0.9564 (0.9564)    Prec@1 80.444 (80.444)  Prec@5 97.333 (97.333)
 ****** Prec@1 78.330 Prec@5 97.420 Loss 1.002 
Best test precision: 78.650

Haven't run the baseline WRN 28-2, but it seems far away from the numbers reported in the paper. On the positive side, the results must be better than WRN28-2, because they are already better than WRN28-3 performance.

Wonder if the official TensorFlow code can reproduce the paper result.. :(

Jongchan commented 5 years ago

In the TensorFlow version, I was able to get the numbers reported in the paper. I can safely think the official code works well with SSL methods, too. I will validate that after, tho. One thing to notice in the official code is that they don't use any regularizations. Kinda different from the NeurIPS paper (and suppl).

It may take a while to thoroughly go over the code, but currently I suspect that the inferior result in this repository is due to possibly different normalization or different split. I will report back after running with the same data and split.

CheukNgai commented 5 years ago

@Jongchan Hi, i am working on the reproduction of the paper in pytorch too. I've checked the wideresnet implementation and find "activation before res" set wrong in comparison with the TF code. I think you can have a try on this issue.

Jongchan commented 5 years ago

@CheukNgai Hi, currently I am not using the code from this repository. But as you pointed out, it seems they are using the opposite settings from TF implementation. I missed that point. I am currently using codes from https://github.com/xternalz/WideResNet-pytorch/blob/master/wideresnet.py and changed ReLU to LeakyReLU.

Now I can safely achieve baseline performance in the NeurIPS paper, around 20.2~20.4% error rate with plain WRN28-2. Mostly I've changed data pre-processing (the normalization part), and removed L1 / L2 regularization.

Counter-intuitively, removing L1/L2 regularizations improves top-1 accuracy, even if CIFAR-10 is a highly-overfittable dataset.

CheukNgai commented 5 years ago

@Jongchan could you give more details on the your implementation on preprocessing? It would help me a lot! Thank you!

Jongchan commented 5 years ago

@CheukNgai Sorry, my comment must have mislead you. I just used the same preprocessing as mentioned in this repository's readme (GCN and ZCA) with gaussian noise with std 0.15. I used mean-std normalization before.

Anyways, try removing L1/L2, or set L1 as a very very small number.