microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.76k stars 495 forks source link

Question About tuning PyTorch with FLAML #1255

Closed lizhuoq closed 7 months ago

lizhuoq commented 7 months ago

I have noticed that for some experiments, the program issues the following warnings, while for others, tuning seems to be normal. What could be causing this situation?
2023-11-11 14:24:51,229 WARNING optuna.py:516 -- Trial 8 failed, because the objective function returned nan. /data/home/scv7343/.conda/envs/timeSeries/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1559: RuntimeWarning: All-NaN slice encountered r, k = function_base._ureduce(a,

| Trial name | status | loc | batch_size | bidirectional | d_ff | dropout | hidden_size | lr | num_epochs | num_layers | iter | total time (s) | loss |

|----------------+------------+---------------------+--------------+-----------------+--------+-----------+---------------+-------------+--------------+--------------+--------+------------------+---------------|

| train_e11c7fae | TERMINATED | 192.168.11.11:80535 | 4 | False | 6 | 0.2 | 6 | 0.00122582 | 1 | 2 | 1 | 48.4646 | 0.00304596 |

| train_e74fb986 | TERMINATED | 192.168.11.11:80699 | 5 | True | 3 | 0.1 | 6 | 0.0185623 | 1.50238 | 7 | 2 | 56.8906 | 0.0137665 |

| train_ec60e724 | TERMINATED | 192.168.11.11:1125 | 6 | False | 2 | 0.1 | 7 | 0.00107878 | 1.09439 | 7 | 1 | 14.7759 | 0.00311929 |

| train_04f53d58 | TERMINATED | 192.168.11.11:1974 | 3 | True | 4 | 0.2 | 6 | 0.00646633 | 1 | 2 | 1 | 84.5842 | nan |

| train_0f1de834 | TERMINATED | 192.168.11.11:2834 | 6 | False | 8 | 0.2 | 6 | 0.000232379 | 1.35825 | 2 | 1 | 11.2428 | 0.00337107 |

| train_140cf60a | TERMINATED | 192.168.11.11:3717 | 3 | False | 8 | 0.1 | 8 | 0.0343065 | 2.59604 | 2 | 3 | 305.777 | nan |

| train_1f314ed2 | TERMINATED | 192.168.11.11:5099 | 5 | False | 4 | 0.3 | 4 | 0.0001 | 1 | 2 | 1 | 21.4655 | 0.00549053 |

| train_4689d9b8 | TERMINATED | 192.168.11.11:6011 | 4 | False | 6 | 0.4 | 8 | 0.000611853 | 1.12409 | 4 | 1 | 77.1362 | 0.0032136 |

| train_57cf9384 | TERMINATED | 192.168.11.11:8158 | 3 | False | 6 | 0.1 | 6 | 0.000150284 | 1 | 2 | 1 | 78.9998 | nan |

| train_8a84b94e | TERMINATED | 192.168.11.11:10349 | 3 | False | 5 | 0.2 | 6 | 0.0764354 | 1.40647 | 5 | 1 | 111.947 | nan |

| train_be5c1e56 | TERMINATED | 192.168.11.11:11479 | 3 | True | 2 | 0.3 | 8 | 0.000100846 | 1 | 5 | 1 | 245.312 | nan |

| train_d9c6ba66 | TERMINATED | 192.168.11.11:13017 | 5 | False | 3 | 0.4 | 2 | 0.0185104 | 1 | 4 | 1 | 32.0445 | 0.00525634 |

| train_05e6f426 | TERMINATED | 192.168.11.11:14052 | 3 | False | 7 | 0.1 | 7 | 0.1 | 1 | 5 | 1 | 134.724 | nan |

| train_1dbd4794 | TERMINATED | 192.168.11.11:16667 | 4 | True | 3 | 0.4 | 5 | 0.0132043 | 2.59677 | 5 | 1 | 49.0496 | 0.00669971 |

| train_7094712c | TERMINATED | 192.168.11.11:16734 | 3 | False | 6 | 0.1 | 4 | 0.1 | 1 | 8 | 1 | 93.9293 | nan |

| train_77053eb0 | TERMINATED | 192.168.11.11:18229 | 3 | False | 4 | 0.3 | 8 | 0.0399277 | 5.7132 | 2 | 6 | 621.57 | nan |

| train_966f55e2 | TERMINATED | 192.168.11.11:19347 | 6 | False | 2 | 0.1 | 4 | 0.027539 | 6.12122 | 6 | 1 | 13.5032 | 0.0139319 |

| train_b4ec4f5c | TERMINATED | 192.168.11.11:20198 | 6 | True | 8 | 0.5 | 5 | 0.000893315 | 4.17414 | 6 | 1 | 15.8931 | 0.00331253 |

| train_bd81b76a | TERMINATED | 192.168.11.11:21106 | 6 | True | 4 | 0.4 | 8 | 0.000118385 | 3.53663 | 5 | 1 | 57.862 | 0.00385862 |

| train_cbd250d6 | TERMINATED | 192.168.11.11:23011 | 4 | True | 5 | 0.4 | 5 | 0.0473465 | 6.20703 | 2 | 1 | 45.3076 | 0.00752863 |

| train_f32476dc | TERMINATED | 192.168.11.11:24188 | 4 | False | 3 | 0.3 | 7 | 0.0510762 | 2.87959 | 7 | 1 | 55.7606 | 0.0142039 |

| train_16a8fcae | TERMINATED | 192.168.11.11:26048 | 3 | True | 7 | 0.1 | 5 | 0.1 | 1 | 3 | 1 | 92.0403 | nan |

| train_384e1fec | TERMINATED | 192.168.11.11:28574 | 3 | False | 7 | 0.3 | 6 | 0.1 | 11.4131 | 3 | 11 | 895.746 | nan |

| train_7861c976 | TERMINATED | 192.168.11.11:34686 | 4 | False | 2 | 0.2 | 8 | 0.022485 | 14.7269 | 2 | 1 | 50.4969 | 0.00738961 |

| train_0db0e0c0 | TERMINATED | 192.168.11.11:36301 | 3 | True | 5 | 0.3 | 4 | 0.0187007 | 1.46299 | 6 | 1 | 99.0067 | nan |

| train_30af1646 | TERMINATED | 192.168.11.11:38993 | 4 | True | 5 | 0.1 | 7 | 0.000156574 | 1.09865 | 2 | 1 | 41.5938 | 0.00306336 |

| train_714334c6 | TERMINATED | 192.168.11.11:40432 | 3 | False | 5 | 0.3 | 3 | 0.00637267 | 3.77981 | 5 | 4 | 317.872 | nan |

| train_8ae279c8 | TERMINATED | 192.168.11.11:48678 | 4 | True | 5 | 0.4 | 6 | 0.0535782 | 11.7744 | 8 | 1 | 56.2163 | 0.0140399 |

| train_5151f08e | TERMINATED | 192.168.11.11:50518 | 4 | False | 3 | 0.1 | 6 | 0.0519958 | 1 | 7 | 1 | 43.9592 | 0.0156737 |

| train_735d53da | TERMINATED | 192.168.11.11:51639 | 3 | True | 7 | 0.1 | 7 | 0.000137333 | 1.25504 | 3 | 1 | 104.057 | nan |

| train_8ecaaadc | TERMINATED | 192.168.11.11:52318 | 3 | False | 4 | 0.2 | 7 | 0.0013103 | 2.70949 | 8 | 3 | 298.224 | nan |

| train_939e9faa | TERMINATED | 192.168.11.11:54377 | 3 | False | 6 | 0.2 | 5 | 0.1 | 1 | 2 | 1 | 91.7292 | nan |

| train_d18d8c04 | TERMINATED | 192.168.11.11:56984 | 3 | False | 6 | 0.2 | 6 | 0.0588597 | 17.8717 | 8 | 18 | 1720.84 | nan |

| train_0d62bb50 | TERMINATED | 192.168.11.11:58797 | 3 | False | 4 | 0.2 | 6 | 0.0992592 | 1 | 2 | 1 | 79.1478 | nan |

| train_49f33d88 | TERMINATED | 192.168.11.11:61011 | 3 | False | 5 | 0.1 | 5 | 0.1 | 1 | 4 | 1 | 84.6235 | nan |

| train_7e66cc74 | TERMINATED | 192.168.11.11:63493 | 3 | False | 5 | 0.5 | 7 | 0.0168932 | 2.69856 | 6 | 3 | 282.819 | nan |

| train_b5decf76 | TERMINATED | 192.168.11.11:69952 | 3 | False | 8 | 0.1 | 7 | 0.1 | 5.11927 | 3 | 5 | 409.286 | nan |

| train_63553442 | TERMINATED | 192.168.11.11:598 | 4 | False | 2 | 0.3 | 5 | 0.0383164 | 1 | 7 | 1 | 46.3223 | 0.0140411 |

| train_5c804426 | TERMINATED | 192.168.11.11:2158 | 4 | False | 6 | 0.4 | 2 | 0.000119125 | 5.02148 | 8 | 1 | 47.0986 | 0.00863629 |

| train_7caf1916 | TERMINATED | 192.168.11.11:3930 | 5 | True | 5 | 0.5 | 2 | 0.0239483 | 15.0096 | 8 | 1 | 29.3272 | 0.014375 |

| train_9d7797ea | TERMINATED | 192.168.11.11:5225 | 4 | False | 8 | 0.5 | 2 | 0.000177693 | 1 | 2 | 1 | 40.6858 | 0.00613946 |

| train_b8002d7a | TERMINATED | 192.168.11.11:6658 | 5 | True | 3 | 0.3 | 2 | 0.00661055 | 1.15587 | 8 | 1 | 29.4151 | 0.00511108 |

| train_d0bec0ce | TERMINATED | 192.168.11.11:7639 | 5 | True | 4 | 0.3 | 2 | 0.00784692 | 1.10952 | 7 | 1 | 28.0803 | 0.00395865 |

| train_eab1bf90 | TERMINATED | 192.168.11.11:9225 | 5 | True | 4 | 0.3 | 4 | 0.00065422 | 27.6525 | 7 | 28 | 770.439 | 0.000958763 |

| train_fc478c80 | TERMINATED | 192.168.11.11:26377 | 4 | True | 2 | 0.3 | 6 | 0.000344019 | 4.2654 | 5 | 1 | 53.3536 | 0.0139351 |

| train_13d61082 | TERMINATED | 192.168.11.11:29144 | 3 | True | 4 | 0.5 | 4 | 0.000750929 | 19.9027 | 7 | 20 | 2126.84 | nan |

| train_38c8bc82 | TERMINATED | 192.168.11.11:38436 | 4 | False | 6 | 0.1 | 6 | 0.000348246 | 23.5836 | 6 | 1 | 46.7607 | 0.00355066 |

| train_cc06b468 | TERMINATED | 192.168.11.11:39644 | 4 | False | 6 | 0.1 | 3 | 0.000285203 | 23.9888 | 4 | 1 | 42.4308 | 0.0041376 |

| train_ecd0eda8 | TERMINATED | 192.168.11.11:41342 | 4 | True | 6 | 0.1 | 3 | 0.000296803 | 2.07096 | 4 | 1 | 49.6519 | 0.00405091 |

| train_0a9cc3e8 | TERMINATED | 192.168.11.11:42640 | 5 | True | 3 | 0.3 | 4 | 0.00241029 | 2.5741 | 4 | 1 | 25.3753 | 0.00357636 |

| train_2e6243f2 | TERMINATED | 192.168.11.11:44154 | 3 | False | 8 | 0.3 | 5 | 0.0165341 | 2.12687 | 7 | 2 | 183.766 | nan |

| train_420eecde | TERMINATED | 192.168.11.11:48320 | 5 | True | 4 | 0.3 | 5 | 0.00223664 | 22.3936 | 7 | 22 | 608.146 | 0.000652129 |

| train_b4807b84 | TERMINATED | 192.168.11.11:70454 | 3 | False | 2 | 0.1 | 7 | 0.1 | 1 | 3 | 1 | 81.8684 | nan |

| train_23be34ae | TERMINATED | 192.168.11.11:72636 | 3 | True | 4 | 0.2 | 2 | 0.00129991 | 3.22774 | 3 | 3 | 257.53 | nan |

| train_5d9b06f2 | TERMINATED | 192.168.11.11:80878 | 3 | False | 3 | 0.1 | 5 | 0.0513343 | 1.14958 | 8 | 1 | 96.0947 | nan |

| train_f8039826 | TERMINATED | 192.168.11.11:1782 | 3 | False | 7 | 0.3 | 7 | 0.1 | 1.72077 | 2 | 2 | 157.375 | nan |

| train_3a34a686 | TERMINATED | 192.168.11.11:7818 | 3 | False | 8 | 0.2 | 6 | 0.0213305 | 14.0408 | 6 | 14 | 1234.27 | nan |

| train_98e66bba | TERMINATED | 192.168.11.11:16774 | 4 | False | 2 | 0.2 | 6 | 0.1 | 1 | 4 | 1 | 43.9384 | 0.0169516 |

| train_317f89f6 | TERMINATED | 192.168.11.11:19154 | 4 | False | 3 | 0.5 | 6 | 0.00284206 | 2.54113 | 7 | 1 | 46.5408 | 0.0118879 |

| train_51bfcffa | TERMINATED | 192.168.11.11:21248 | 5 | False | 5 | 0.2 | 7 | 0.0937509 | 1 | 3 | 1 | 22.4211 | 0.0154518 |

| train_72786a0e | TERMINATED | 192.168.11.11:22816 | 6 | True | 4 | 0.3 | 4 | 0.00143587 | 22.9981 | 6 | 4 | 57.359 | 0.00214795 |

| train_84f48d34 | TERMINATED | 192.168.11.11:27107 | 5 | True | 4 | 0.3 | 4 | 0.00133719 | 22.0468 | 6 | 4 | 107.115 | 0.00215503 |

| train_aff5bcd8 | TERMINATED | 192.168.11.11:32559 | 6 | True | 3 | 0.3 | 4 | 0.00157787 | 22.6956 | 6 | 4 | 57.675 | 0.00233549 |

| train_f040f3de | TERMINATED | 192.168.11.11:36853 | 3 | False | 5 | 0.2 | 5 | 0.062318 | 6.97867 | 7 | 7 | 614.272 | nan |

| train_179557c2 | TERMINATED | 192.168.11.11:52844 | 6 | False | 8 | 0.2 | 7 | 0.00201595 | 1 | 5 | 1 | 13.3189 | 0.00344409 |

| train_7d40001c | TERMINATED | 192.168.11.11:53924 | 4 | False | 7 | 0.1 | 4 | 0.0114596 | 1 | 6 | | | |

| train_89c32f94 | TERMINATED | 192.168.11.11:54332 | 3 | False | 3 | 0.4 | 8 | 0.1 | 3.07152 | 4 | | | |

| train_928e2908 | TERMINATED | | 3 | False | 3 | 0.1 | 6 | 0.00179522 | 2.01488 | 5 | | | |

+----------------+------------+---------------------+--------------+-----------------+--------+-----------+---------------+-------------+--------------+--------------+--------+------------------+---------------+

sonichi commented 7 months ago

Check you training function and understand why it returns nan sometimes. It's not an issue with the tuning but an issue with the training and evaluation function.

lizhuoq commented 7 months ago

Thank you for your response. I will look for the issues in the logic of my code.