mit-han-lab / once-for-all

[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
https://ofa.mit.edu/
MIT License
1.89k stars 333 forks source link

What should I do after train_ofa_net #57

Open detectRecog opened 3 years ago

detectRecog commented 3 years ago

I run train_ofa_net.py and there is three folders under 'exp/': 'kernel2kernel_depth', 'kernel_depth2kernel_depth_width', 'normal2kernel'. Then, what should I do next? There are 'checkpoint logs net.config net_info.txt run.config' under each exp subfolder after training. Anybody knows how should I deal with it?

I can not find any relations between the training exp results and 'eval_ofa_net.py'. Please help this poor kid. \doge

Bixiii commented 3 years ago

As far as I can tell, the folders are the different stages of the progressive shrinking algorithm, kernel2kernel_depth is the training step from elastic kernel to elastic kernel and elastic depth. In the checkpoint folder you can find the trained models , model_best.pth.tar should be the final model for that step. When you want to evaluate the model you trained yourself, you have to load them in the eval_ofa_net.py script. For that you can just replace

ofa_network = ofa_net(args.net, pretrained=True)

with something to load your own network. Maybe something like this would work:

ofa_network = OFAMobileNetV3(
    ks_list=[3, 5, 7],
    expand_ratio_list=[3, 4, 6],
    depth_list=[2, 3, 4],
)       
init = torch.load('exp/kernel_depth2kernel_depth_width/phase2/checkpoint/model_best_pth.tar',map_location='cpu')['state_dict']
ofa_network.load_state_dict(init)
detectRecog commented 3 years ago

As far as I can tell, the folders are the different stages of the progressive shrinking algorithm, kernel2kernel_depth is the training step from elastic kernel to elastic kernel and elastic depth. In the checkpoint folder you can find the trained models , model_best.pth.tar should be the final model for that step. When you want to evaluate the model you trained yourself, you have to load them in the eval_ofa_net.py script. For that you can just replace

ofa_network = ofa_net(args.net, pretrained=True)

with something to load your own network. Maybe something like this would work:

ofa_network = net = OFAMobileNetV3(
    ks_list=[3, 5, 7],
    expand_ratio_list=[3, 4, 6],
    depth_list=[2, 3, 4],
)       
init = torch.load('exp/kernel_depth2kernel_depth_width/phase2/checkpoint/model_best_pth.tar',map_location='cpu')['state_dict']
ofa_network.load_state_dict(init)

You're so kind. Thank you very much for your reply as I'm waiting for someone to save me everyday. Does this mean I should train for different stages sequentially with resuming the best checkpoint of the previous stage? Currently, I train different stages in parallel. And this is why I struggled to find the relations between checkpoints at different stages.

@Bixiii

Jon-drugstore commented 3 years ago

As far as I can tell, the folders are the different stages of the progressive shrinking algorithm, kernel2kernel_depth is the training step from elastic kernel to elastic kernel and elastic depth. In the checkpoint folder you can find the trained models , model_best.pth.tar should be the final model for that step. When you want to evaluate the model you trained yourself, you have to load them in the eval_ofa_net.py script. For that you can just replace

ofa_network = ofa_net(args.net, pretrained=True)

with something to load your own network. Maybe something like this would work:

ofa_network = net = OFAMobileNetV3(
    ks_list=[3, 5, 7],
    expand_ratio_list=[3, 4, 6],
    depth_list=[2, 3, 4],
)       
init = torch.load('exp/kernel_depth2kernel_depth_width/phase2/checkpoint/model_best_pth.tar',map_location='cpu')['state_dict']
ofa_network.load_state_dict(init)

Do you have any ideas for the detail of latency predictor model? how to build the network ? Thanks for your replay!

pyjhzwh commented 3 years ago

As far as I can tell, the folders are the different stages of the progressive shrinking algorithm, kernel2kernel_depth is the training step from elastic kernel to elastic kernel and elastic depth. In the checkpoint folder you can find the trained models , model_best.pth.tar should be the final model for that step. When you want to evaluate the model you trained yourself, you have to load them in the eval_ofa_net.py script. For that you can just replace

ofa_network = ofa_net(args.net, pretrained=True)

with something to load your own network. Maybe something like this would work:

ofa_network = net = OFAMobileNetV3(
    ks_list=[3, 5, 7],
    expand_ratio_list=[3, 4, 6],
    depth_list=[2, 3, 4],
)       
init = torch.load('exp/kernel_depth2kernel_depth_width/phase2/checkpoint/model_best_pth.tar',map_location='cpu')['state_dict']
ofa_network.load_state_dict(init)

Do you have any ideas for the detail of latency predictor model? how to build the network ? Thanks for your replay!

In my understanding, once-for-all/ofa/nas/efficiency_predictor/latency_lookup_table.py describes how do they estimate the latency. For ResNet50, they just count FLOPs to represent latency

pyjhzwh commented 3 years ago

As far as I can tell, the folders are the different stages of the progressive shrinking algorithm, kernel2kernel_depth is the training step from elastic kernel to elastic kernel and elastic depth. In the checkpoint folder you can find the trained models , model_best.pth.tar should be the final model for that step. When you want to evaluate the model you trained yourself, you have to load them in the eval_ofa_net.py script. For that you can just replace

ofa_network = ofa_net(args.net, pretrained=True)

with something to load your own network. Maybe something like this would work:

ofa_network = net = OFAMobileNetV3(
    ks_list=[3, 5, 7],
    expand_ratio_list=[3, 4, 6],
    depth_list=[2, 3, 4],
)       
init = torch.load('exp/kernel_depth2kernel_depth_width/phase2/checkpoint/model_best_pth.tar',map_location='cpu')['state_dict']
ofa_network.load_state_dict(init)

You're so kind. Thank you very much for your reply as I'm waiting for someone to save me everyday. Does this mean I should train for different stages sequentially with resuming the best checkpoint of the previous stage? Currently, I train different stages in parallel. And this is why I struggled to find the relations between checkpoints at different stages.

@Bixiii

I guess so. from task 'kernel' to 'depth', the depth list has more choices, from 'depth' to 'expand', the depth_list has more choices. I guess we should run task 'kernel' first, then 'depth', finally 'expand'.