huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
845 stars 176 forks source link

nago cifar-10 result replication #95

Closed runninghack closed 3 years ago

runninghack commented 3 years ago

Could you share the .yml that can replicate the results on CIFAR-10 from the nago paper?

Only the nas pipeline is included in the default nago example, and several hyperparameters are different from what is mentioned in the paper. I tried to tune the hyperparameters but still cannot get the same results as in the nago paper.

zhangjiajin commented 3 years ago

What are your performance results? Which hyperparameters have you adjusted in the configuration file?

@runninghack

runninghack commented 3 years ago

Thank you for your response. @zhangjiajin

What are your performance results?

Test accuracy 93.41% on CIFAR-10 (it was supposed to be 96.6% according to the paper)

Which hyperparameters have you adjusted in the configuration file?

Generator hyperparameters in Table 3 in the paper: WS(8, 5, 0.6), ER(1, 0.7), WS(5, 4, 0.2)

Other settings are as follows:

  1. n_parameter_limit: 4.0e6
  2. Data: CIFAR-10 (cutout length: 16, batch size: 96)
  3. Optimizer: SGD (initial learning rate 0.025)
  4. lr_scheduler: StepLR (step_size: 20, gamma: 0.5)
  5. epochs: 600
  6. GPU: Tesla v100
  7. Others: No DropPath, No Auxiliary Towers

As the fullytrain pipeline is not included in the default yml file, I had to set the above settings by myself. Everything else stays as default. I'm not sure what else needs to be adjusted to achieve the 96.6% top-1 accuracy.

zhangjiajin commented 3 years ago

Thank you for pointing out this. I'll consult with my colleagues. @runninghack

rubinxin commented 3 years ago

@runninghack Thanks for your interest in NAGO.

I'm not sure what else needs to be adjusted to achieve the 96.6% top-1 accuracy. The discrepancy in the test accuracy is likely due to the lr_scheduler setting which should be:

  • lr_scheduler: CosineAnnealing (target min lr=0 at the end of 600 epochs)

Two other hyperparameters are set following the DARTS set-up (very likely you have already used these values but I'll list them below just in case):

I think these should address the problem.

runninghack commented 3 years ago

Thanks for the info. @rubinxin

Using CosineAnnealing does improve the performance. However, I still cannot get good enough performance on CIFAR-10 as described in the paper. I've trained several models and got test accuracy between 94.591 and 95.252. It's not even as good as HNAG-RS so I assume there's still something wrong with my .yml file.

My fullytrain pipeline configuration is as follows:

fullytrain:
    pipe_step:
        type: FullyTrainPipeStep
        models_folder: #Path with the desc file#
    trainer:
        type: Trainer
        epochs: 600
        optimizer:
            type: SGD
            params:
                lr: 0.025
                momentum: 0.9
                weight_decay: !!float 3e-4
        lr_scheduler:
            type: CosineAnnealingLR
            params:
                T_max: 600.0
                eta_min: 0.000
        grad_clip: 5.0
    dataset:
        type: Cifar10
        common:
            batch_size: 96
            train_portion: 0.9
        train:
            cutout_length: 16

and the desc file is as follows:

{"modules": ["custom"],
 "custom": {"type": "NAGO", "stage1_ratio": 0.33, "stage2_ratio": 0.33, "stage3_ratio": 0.33,
            "ch1_ratio": 1, "ch2_ratio": 2, "ch3_ratio": 4,
            "n_param_limit": "4.0e6", "image_size": 32, "num_classes": 10,
            "G3_P": 0.2, "G3_K": 4, "G3_nodes": 5,
            "G2_P": 0.7, "G2_nodes": 1,
            "G1_P": 0.6, "G1_K": 5, "G1_nodes": 8}}

Can you spot anything different from yours?

runninghack commented 3 years ago

Just saw the vega 1.3 update. The fullytrain pipeline is still missing for NAGO tho.

rubinxin commented 3 years ago

Thanks for the info. @rubinxin

Using CosineAnnealing does improve the performance. However, I still cannot get good enough performance on CIFAR-10 as described in the paper. I've trained several models and got test accuracy between 94.591 and 95.252. It's not even as good as HNAG-RS so I assume there's still something wrong with my .yml file.

My fullytrain pipeline configuration is as follows:

fullytrain:
    pipe_step:
        type: FullyTrainPipeStep
        models_folder: #Path with the desc file#
    trainer:
        type: Trainer
        epochs: 600
        optimizer:
            type: SGD
            params:
                lr: 0.025
                momentum: 0.9
                weight_decay: !!float 3e-4
        lr_scheduler:
            type: CosineAnnealingLR
            params:
                T_max: 600.0
                eta_min: 0.000
        grad_clip: 5.0
    dataset:
        type: Cifar10
        common:
            batch_size: 96
            train_portion: 0.9
        train:
            cutout_length: 16

and the desc file is as follows:

{"modules": ["custom"],
 "custom": {"type": "NAGO", "stage1_ratio": 0.33, "stage2_ratio": 0.33, "stage3_ratio": 0.33,
            "ch1_ratio": 1, "ch2_ratio": 2, "ch3_ratio": 4,
            "n_param_limit": "4.0e6", "image_size": 32, "num_classes": 10,
            "G3_P": 0.2, "G3_K": 4, "G3_nodes": 5,
            "G2_P": 0.7, "G2_nodes": 1,
            "G1_P": 0.6, "G1_K": 5, "G1_nodes": 8}}

Can you spot anything different from yours?

@runninghack Sorry for the late reply. Your configuration looks almost the same as mine except very minor differences, which I don't think will lead to such big performance differences:

  1. train_portion: 0.9 --> train_portion: 1.0 (for complete training, we use the entire training set and validate on the test set to get test accuracy)
  2. "stage1_ratio": 0.33, "stage2_ratio": 0.33, "stage3_ratio": 0.33 --> "stage1_ratio": 1/3, "stage2_ratio": 1/3 , "stage3_ratio": 1/3

After detailed checking, I suspect the main cause for the discrepancy is that in VEGA, we fix the seed for the graph generation so that the users can resume training (i.e. all the graphs in the same hierarchical level are exactly the same) but in original NAGO, we allow the seed to vary (i.e. graphs in the same hierarchical level can differ slightly though following the same generative distribution). To fix this, you can replace the class BasicNode function (line 155-189) in vega/zeus/networks/pytorch/customs/utils/logical_graph.py with class BasicNode function (line 157-193) here

I've tried rerunning NAGO with that fix and obtain a test accuracy of 96.45% to 96.74%. Hope this solve your issue as well.