microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14k stars 1.81k forks source link

Proxyless NAS loss does not change with a Quartznet-based network #4643

Open singagan opened 2 years ago

singagan commented 2 years ago

Describe the issue:

The training loss does not change while using proxyless with quartznet ASR network. I am using the same hyperparameters as the default proxyless with cortexA76cpu_tflite21 device. I see NAS is trying different architecture but but loss does not change much.

[50000/50000]: 100%|########################################| [52:29, latency=416.7913, loss=1.5568]
[50000/50000]: 100%|########################################| [52:21, latency=318.4291, loss=1.5101]
[50000/50000]: 100%|########################################| [52:34, latency=182.5729, loss=1.5076]
[50000/50000]: 100%|########################################| [52:02, latency=182.5729, loss=1.5069]
[50000/50000]: 100%|########################################| [52:21, latency=153.6174, loss=1.5059]
[50000/50000]: 100%|########################################| [52:26, latency=129.1333, loss=1.5054]
[50000/50000]: 100%|########################################| [52:43, latency=141.4677, loss=1.5044]
[50000/50000]: 100%|########################################| [52:08, latency=216.2379, loss=1.5034]
[50000/50000]: 100%|########################################| [52:31, latency=251.3946, loss=1.5039]
[50000/50000]: 100%|########################################| [52:46, latency=360.1063, loss=1.5040]
[50000/50000]: 100%|########################################| [52:02, latency=357.7755, loss=1.5039]
[50000/50000]: 100%|########################################| [51:58, latency=357.8237, loss=1.5035]
[50000/50000]: 100%|########################################| [52:09, latency=361.6095, loss=1.5037]
[50000/50000]: 100%|########################################| [52:11, latency=252.4627, loss=1.5031]
[50000/50000]: 100%|########################################| [52:04, latency=250.8442, **loss=1.5028]**

I have a couple of questions: 1) Is it like I should not bother much about this loss and after NAS search, retrain the network to check for the actual loss because that is the loss that matters? 2) The issue could be that the device I am using (cortexA76cpu_tflite21 ) uses 2d modules (conv2, batch2d, etc), while quartznet has 1d modules? 3) Should I let NAS to search for a couple more hours/days and hope that NAS loss would reduce? 4) Also, I would like to know how strict is it the effect of target latency because in my case the see that the NAS is going either too low or too high compared to the provided target latency, like in the above case I provided a target latency of 1000ms.

Environment: Remote Server

NNI version: 2.6 Training service (local|remote|pai|aml|etc): remote Client OS: Ubuntu 20.04.3 LTS Server OS (for remote mode only): Ubuntu 20.04.3 LTS Python version: Python 3.8.12 PyTorch/TensorFlow version: Toch 1.10.0+cu111 Is conda/virtualenv/venv used?: conda Is running in Docker?: No

Experiment config (remember to remove secrets!): one-shot ProxylessNAS

singagan commented 2 years ago

I realized even in the case of the provided ProxylessNAS example, the loss does not change much. So I am confused about how do we know if we are going towards an optimum solution? Maybe it needs many more training iterations. However, still how could we be sure whether the solution is optimum?

I will appreciate any feedback on it. Thanks! :)

2022-03-19 20:57:12] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [1/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.000000 (0.000000)  loss 6.948665 (6.948665)  latency 64.684225 (
64.684225)      
                                                                                                                                                                                                                                   [2022-03-19 20:57:17] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [2/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.000000 (0.000000)  loss 6.701596 (6.701596)  latency 66.791252 (
66.791252)            
                                                                                                                                                                                                                                       [2022-03-19 20:57:22] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [3/100] Step [1/8]  acc1 0.015625 (0.015625)  acc5 0.015625 (0.015625)  loss 6.517473 (6.517473)  latency 59.435130 (
59.435130)        
                                                                                                                                                                                                                                    [2022-03-19 20:57:26] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [4/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.031250 (0.031250)  loss 6.468095 (6.468095)  latency 58.902744 (
58.902744)     
                                                                                                                                                                                                                                        [2022-03-19 20:57:31] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [5/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.000000 (0.000000)  loss 6.393481 (6.393481)  latency 66.434750 (
66.434750)

[2022-03-19 20:57:36] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [6/100] Step [1/8]  acc1 0.015625 (0.015625)  acc5 0.046875 (0.046875)  loss 6.403239 (6.403239)  latency 63.404785 (
63.404785)

[2022-03-19 20:57:41] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [7/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.015625 (0.015625)  loss 6.431452 (6.431452)  latency 65.255388 (
65.255388)

[2022-03-19 20:57:46] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [8/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.000000 (0.000000)  loss 6.448963 (6.448963)  latency 65.399601 (
65.399601)

[2022-03-19 20:57:51] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [9/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.015625 (0.015625)  loss 6.417759 (6.417759)  latency 62.633813 (
62.633813)

[2022-03-19 20:57:56] INFO (nni.retiarii.oneshot.pytorch.proxyless/MainThread) Epoch [10/100] Step [1/8]  acc1 0.000000 (0.000000)  acc5 0.015625 (0.015625)  loss 6.372709 (6.372709)  latency 59.040452
(59.040452)
singagan commented 2 years ago

Any updates on this? Thanks!

JiahangXu commented 2 years ago

Hi, sorry for the late reply. I think in both of quartznet and proxyless nas example, the loss is decreasing. Maybe you could try to train for more times for clearer descent. The target latency depends on your actual demand to the model deployment on target device. Usually the reference latency should be set smaller than the predicted latency of searching models to push the network search for a faster model. If you are not sure which reference latency is suitable in your case, you can provide the FLOPS of quartznet to us, so that maybe we can give some suggestion. I hope this information is helpful.

singagan commented 2 years ago

Hi @JiahangXu, Thanks for the reply! I tried running quartznet with proxyless nas for more epochs. Below you could see the behavior after 63 epochs. The loss hovers around 1.5 so I am not sure if the NAS selected architecture is the optimum architecture. Also, I provided a reference latency of 1000 but after 63 epoch I observe the architecture latency is way over 1000 ms (please see below)

The FLOPS for quartznet model is 29129304 (calculated using count_flops_params () ). I would appreciate any suggestion.

Epoch 46 [12500/12500]: 100%|#######################| [18:53, latency=5958.9584, loss=1.5003]
Epoch 47 [12500/12500]: 100%|#######################| [18:50, latency=6082.1017, loss=1.4956]
Epoch 48 [12500/12500]: 100%|#######################| [18:33, latency=6082.1017, loss=1.4950]
Epoch 49 [12500/12500]: 100%|#######################| [18:17, latency=6991.5863, loss=1.4979]
Epoch 50 [12500/12500]: 100%|#######################| [18:18, latency=7228.9303, loss=1.4969]
Epoch 51 [12500/12500]: 100%|#######################| [18:07, latency=7073.7954, loss=1.4957]
Epoch 52 [12500/12500]: 100%|#######################| [17:56, latency=7073.7954, loss=1.4993]
Epoch 53 [12500/12500]: 100%|#######################| [17:35, latency=7073.7954, loss=1.4945]
Epoch 54 [12500/12500]: 100%|#######################| [17:17, latency=7073.7954, loss=1.4937]
Epoch 55 [12500/12500]: 100%|#######################| [17:10, latency=7073.7954, loss=1.4983]
Epoch 56 [12500/12500]: 100%|#######################| [17:16, latency=7073.7954, loss=1.4975]
Epoch 57 [12500/12500]: 100%|#######################| [16:55, latency=7073.7954, loss=1.4975]
Epoch 58 [12500/12500]: 100%|#######################| [16:30, latency=7073.7954, loss=1.4970]
Epoch 59 [12500/12500]: 100%|#######################| [16:36, latency=7073.7954, loss=1.4988]
Epoch 60 [12500/12500]: 100%|#######################| [16:00, latency=7073.7954, loss=1.4959]
Epoch 61 [12500/12500]: 100%|#######################| [15:58, latency=7073.7954, loss=1.4954]
Epoch 62 [12500/12500]: 100%|#######################| [16:04, latency=7073.7954, loss=1.4952]
Epoch 63 [12500/12500]: 100%|#######################| [16:14, latency=7073.7954, loss=1.5011]