Closed runninghack closed 3 years ago
What are your performance results? Which hyperparameters have you adjusted in the configuration file?
@runninghack
Thank you for your response. @zhangjiajin
What are your performance results?
Test accuracy 93.41% on CIFAR-10 (it was supposed to be 96.6% according to the paper)
Which hyperparameters have you adjusted in the configuration file?
Generator hyperparameters in Table 3 in the paper: WS(8, 5, 0.6), ER(1, 0.7), WS(5, 4, 0.2)
Other settings are as follows:
As the fullytrain pipeline is not included in the default yml file, I had to set the above settings by myself. Everything else stays as default. I'm not sure what else needs to be adjusted to achieve the 96.6% top-1 accuracy.
Thank you for pointing out this. I'll consult with my colleagues. @runninghack
@runninghack Thanks for your interest in NAGO.
I'm not sure what else needs to be adjusted to achieve the 96.6% top-1 accuracy. The discrepancy in the test accuracy is likely due to the lr_scheduler setting which should be:
- lr_scheduler: CosineAnnealing (target min lr=0 at the end of 600 epochs)
Two other hyperparameters are set following the DARTS set-up (very likely you have already used these values but I'll list them below just in case):
I think these should address the problem.
Thanks for the info. @rubinxin
Using CosineAnnealing does improve the performance. However, I still cannot get good enough performance on CIFAR-10 as described in the paper. I've trained several models and got test accuracy between 94.591 and 95.252. It's not even as good as HNAG-RS so I assume there's still something wrong with my .yml file.
My fullytrain pipeline configuration is as follows:
fullytrain:
pipe_step:
type: FullyTrainPipeStep
models_folder: #Path with the desc file#
trainer:
type: Trainer
epochs: 600
optimizer:
type: SGD
params:
lr: 0.025
momentum: 0.9
weight_decay: !!float 3e-4
lr_scheduler:
type: CosineAnnealingLR
params:
T_max: 600.0
eta_min: 0.000
grad_clip: 5.0
dataset:
type: Cifar10
common:
batch_size: 96
train_portion: 0.9
train:
cutout_length: 16
and the desc file is as follows:
{"modules": ["custom"],
"custom": {"type": "NAGO", "stage1_ratio": 0.33, "stage2_ratio": 0.33, "stage3_ratio": 0.33,
"ch1_ratio": 1, "ch2_ratio": 2, "ch3_ratio": 4,
"n_param_limit": "4.0e6", "image_size": 32, "num_classes": 10,
"G3_P": 0.2, "G3_K": 4, "G3_nodes": 5,
"G2_P": 0.7, "G2_nodes": 1,
"G1_P": 0.6, "G1_K": 5, "G1_nodes": 8}}
Can you spot anything different from yours?
Just saw the vega 1.3 update. The fullytrain pipeline is still missing for NAGO tho.
Thanks for the info. @rubinxin
Using CosineAnnealing does improve the performance. However, I still cannot get good enough performance on CIFAR-10 as described in the paper. I've trained several models and got test accuracy between 94.591 and 95.252. It's not even as good as HNAG-RS so I assume there's still something wrong with my .yml file.
My fullytrain pipeline configuration is as follows:
fullytrain: pipe_step: type: FullyTrainPipeStep models_folder: #Path with the desc file# trainer: type: Trainer epochs: 600 optimizer: type: SGD params: lr: 0.025 momentum: 0.9 weight_decay: !!float 3e-4 lr_scheduler: type: CosineAnnealingLR params: T_max: 600.0 eta_min: 0.000 grad_clip: 5.0 dataset: type: Cifar10 common: batch_size: 96 train_portion: 0.9 train: cutout_length: 16
and the desc file is as follows:
{"modules": ["custom"], "custom": {"type": "NAGO", "stage1_ratio": 0.33, "stage2_ratio": 0.33, "stage3_ratio": 0.33, "ch1_ratio": 1, "ch2_ratio": 2, "ch3_ratio": 4, "n_param_limit": "4.0e6", "image_size": 32, "num_classes": 10, "G3_P": 0.2, "G3_K": 4, "G3_nodes": 5, "G2_P": 0.7, "G2_nodes": 1, "G1_P": 0.6, "G1_K": 5, "G1_nodes": 8}}
Can you spot anything different from yours?
@runninghack Sorry for the late reply. Your configuration looks almost the same as mine except very minor differences, which I don't think will lead to such big performance differences:
After detailed checking, I suspect the main cause for the discrepancy is that in VEGA, we fix the seed for the graph generation so that the users can resume training (i.e. all the graphs in the same hierarchical level are exactly the same) but in original NAGO, we allow the seed to vary (i.e. graphs in the same hierarchical level can differ slightly though following the same generative distribution). To fix this, you can replace the class BasicNode
function (line 155-189) in vega/zeus/networks/pytorch/customs/utils/logical_graph.py
with class BasicNode
function (line 157-193) here
I've tried rerunning NAGO with that fix and obtain a test accuracy of 96.45% to 96.74%. Hope this solve your issue as well.
Could you share the .yml that can replicate the results on CIFAR-10 from the nago paper?
Only the nas pipeline is included in the default nago example, and several hyperparameters are different from what is mentioned in the paper. I tried to tune the hyperparameters but still cannot get the same results as in the nago paper.