rwightman / efficientdet-pytorch

A PyTorch impl of EfficientDet faithful to the original Google impl w/ ported weights
Apache License 2.0
1.58k stars 293 forks source link

nan in loss #113

Closed MichaelMonashev closed 3 years ago

MichaelMonashev commented 3 years ago

I am training efficientdet and geting nan in loss after some epochs. I restart training, train some epochs and geting nan in loss again. Now I constantly get nan in loss in 86-th epoch and can not go ahead.

I am using gradient clipping:

torch.nn.utils.clip_grad_norm_(model.parameters(), 0.001)

My train log:

Epoch: 86 rank: 1 [278/528]  Time  787 ( 842)  Data  0 ( 0)  Loss 2.27e-01 (1.83e-01)  Class Loss 1.57e-01 (1.35e-01)  Bbox Loss 1.39e-03 (9.68e-04)
Epoch: 86 rank: 1 [279/528]  Time  796 ( 842)  Data  0 ( 0)  Loss 1.72e-01 (1.83e-01)  Class Loss 1.26e-01 (1.35e-01)  Bbox Loss 9.21e-04 (9.68e-04)
Epoch: 86 rank: 0 [279/528]  Time  799 ( 830)  Data  0 (11)  Loss 1.76e-01 (1.84e-01)  Class Loss 1.34e-01 (1.35e-01)  Bbox Loss 8.39e-04 (9.71e-04)
Epoch: 86 rank: 0 [280/528]  Time  809 ( 830)  Data  0 (11)  Loss 1.67e-01 (1.84e-01)  Class Loss 1.26e-01 (1.35e-01)  Bbox Loss 8.14e-04 (9.71e-04)
Epoch: 86 rank: 1 [280/528]  Time  812 ( 842)  Data  0 ( 0)  Loss 1.96e-01 (1.83e-01)  Class Loss 1.40e-01 (1.35e-01)  Bbox Loss 1.12e-03 (9.68e-04)
Epoch: 86 rank: 0 [281/528]  Time  796 ( 830)  Data  0 (11)  Loss 1.75e-01 (1.84e-01)  Class Loss 1.31e-01 (1.35e-01)  Bbox Loss 8.84e-04 (9.70e-04)
Epoch: 86 rank: 1 [281/528]  Time  859 ( 842)  Data  0 ( 0)  Loss 2.06e-01 (1.83e-01)  Class Loss 1.48e-01 (1.35e-01)  Bbox Loss 1.16e-03 (9.69e-04)
Epoch: 86 rank: 0 [282/528]  Time  950 ( 830)  Data  0 (11)  Loss 1.83e-01 (1.84e-01)  Class Loss 1.38e-01 (1.35e-01)  Bbox Loss 9.04e-04 (9.70e-04)
Epoch: 86 rank: 1 [282/528]  Time  891 ( 842)  Data  0 ( 0)  Loss 1.35e-01 (1.83e-01)  Class Loss 1.09e-01 (1.35e-01)  Bbox Loss 5.14e-04 (9.68e-04)
Epoch: 86 rank: 0 [283/528]  Time  864 ( 831)  Data  0 (11)  Loss 1.97e-01 (1.84e-01)  Class Loss 1.46e-01 (1.35e-01)  Bbox Loss 1.03e-03 (9.70e-04)
Epoch: 86 rank: 1 [283/528]  Time  869 ( 842)  Data  0 ( 0)  Loss 1.84e-01 (1.83e-01)  Class Loss 1.37e-01 (1.35e-01)  Bbox Loss 9.45e-04 (9.67e-04)
Epoch: 86 rank: 0 [284/528]  Time  883 ( 831)  Data  0 (11)  Loss 2.27e-01 (1.84e-01)  Class Loss 1.54e-01 (1.36e-01)  Bbox Loss 1.44e-03 (9.72e-04)
Epoch: 86 rank: 1 [284/528]  Time  899 ( 842)  Data  0 ( 0)  Loss 1.78e-01 (1.83e-01)  Class Loss 1.40e-01 (1.35e-01)  Bbox Loss 7.55e-04 (9.67e-04)
Epoch: 86 rank: 1 [285/528]  Time  672 ( 841)  Data  0 ( 0)  Loss 1.80e-01 (1.83e-01)  Class Loss 1.25e-01 (1.35e-01)  Bbox Loss 1.10e-03 (9.67e-04)
Epoch: 86 rank: 0 [285/528]  Time  750 ( 830)  Data  0 (11)  Loss 2.65e-01 (1.84e-01)  Class Loss 1.81e-01 (1.36e-01)  Bbox Loss 1.69e-03 (9.75e-04)
Epoch: 86 rank: 0 [286/528]  Time  738 ( 830)  Data  0 (11)  Loss 2.02e-01 (1.84e-01)  Class Loss 1.41e-01 (1.36e-01)  Bbox Loss 1.23e-03 (9.75e-04)
Epoch: 86 rank: 1 [286/528]  Time  791 ( 841)  Data  0 ( 0)  Loss 1.88e-01 (1.83e-01)  Class Loss 1.37e-01 (1.35e-01)  Bbox Loss 1.03e-03 (9.67e-04)
Epoch: 86 rank: 1 [287/528]  Time  634 ( 841)  Data  0 ( 0)  Loss 1.84e-01 (1.83e-01)  Class Loss 1.36e-01 (1.35e-01)  Bbox Loss 9.58e-04 (9.67e-04)
Epoch: 86 rank: 0 [287/528]  Time  634 ( 829)  Data  0 (11)  Loss 1.78e-01 (1.84e-01)  Class Loss 1.38e-01 (1.36e-01)  Bbox Loss 7.89e-04 (9.75e-04)
Epoch: 86 rank: 0 [288/528]  Time  830 ( 829)  Data  0 (11)  Loss 1.84e-01 (1.84e-01)  Class Loss 1.39e-01 (1.36e-01)  Bbox Loss 9.04e-04 (9.75e-04)
Epoch: 86 rank: 1 [288/528]  Time  831 ( 841)  Data  0 ( 0)  Loss 1.80e-01 (1.83e-01)  Class Loss 1.33e-01 (1.35e-01)  Bbox Loss 9.52e-04 (9.67e-04)
Epoch: 86 rank: 1 [289/528]  Time  663 ( 840)  Data  0 ( 0)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)
Epoch: 86 rank: 0 [289/528]  Time  667 ( 829)  Data  0 (10)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)
Epoch: 86 rank: 1 [290/528]  Time  700 ( 839)  Data  0 ( 0)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)
Epoch: 86 rank: 0 [290/528]  Time  698 ( 828)  Data  0 (10)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)
Epoch: 86 rank: 1 [291/528]  Time  660 ( 839)  Data  0 ( 0)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)
Epoch: 86 rank: 0 [291/528]  Time  663 ( 828)  Data  0 (10)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)
Epoch: 86 rank: 1 [292/528]  Time  671 ( 838)  Data  0 ( 0)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)
Epoch: 86 rank: 0 [292/528]  Time  686 ( 827)  Data  1 (10)  Loss nan (nan)  Class Loss nan (nan)  Bbox Loss nan (nan)

my summary log:

Epoch   BS  LR  accumulation_steps  train_loss  train_cls_loss      train_bbox_loss     val1_loss       val1_cls_loss       val1_bbox_loss      val2_loss       val2_cls_loss       val2_bbox_loss
Start training from scratch.
Set LR to 0.0003
0   30  0.0003          1   0.7391620874404907  0.5148860216140747  0.004485520999878645    0.6867721676826477  0.40660524368286133 0.0056033385917544365   1.3056379556655884  0.7089716196060181  0.011933325789868832
1   30  0.0003          1   0.5481282472610474  0.3798622488975525  0.0033653199207037687   0.6426258087158203  0.3766266703605652  0.0053199827671051025   1.3292980194091797  0.7338289022445679  0.011909381486475468
2   30  0.0003          1   0.5082456469535828  0.34988415241241455 0.0031672297045588493   0.6246873140335083  0.3702491521835327  0.005088763311505318    1.3154892921447754  0.7328091859817505  0.011653603054583073
3   30  0.0003          1   0.48170071840286255 0.33192750811576843 0.0029954644851386547   0.6350262761116028  0.37599408626556396 0.005180643871426582    1.337411642074585   0.7505775094032288  0.011736682616174221
4   30  0.0003          1   0.4610692262649536  0.31623196601867676 0.002896744990721345    0.6187378764152527  0.36493563652038574 0.005076044239103794    1.3538662195205688  0.7712782621383667  0.011651759967207909
5   30  0.0003          1   0.4470648169517517  0.3066878318786621  0.0028075396548956633   0.6098109483718872  0.3619186580181122  0.004957845434546471    1.3459428548812866  0.7724031805992126  0.011470792815089226
6   30  0.0003          1   0.4333857297897339  0.2963952124118805  0.0027398099191486835   0.6057169437408447  0.3597293496131897  0.004919751547276974    1.3794480562210083  0.810006856918335   0.011388825252652168
7   30  0.0003          1   0.427709698677063   0.29330408573150635 0.0026881122030317783   0.5969955921173096  0.3580610156059265  0.004778692498803139    1.3977634906768799  0.8209059238433838  0.011537151411175728
8   30  0.0003          1   0.41837823390960693 0.2860669493675232  0.00264622550457716 0.5881363749504089  0.3458014130592346  0.00484669953584671 1.3651037216186523  0.7924641370773315  0.011452790349721909
9   30  0.0003          1   0.4098720848560333  0.2819724380970001  0.0025579931680113077   0.587675154209137   0.3519308269023895  0.004714886657893658    1.417775273323059   0.8283770084381104  0.011787965893745422
10  30  0.0003          1   0.40982553362846375 0.280843585729599   0.002579638734459877    0.5595372319221497  0.3379554748535156  0.004431635141372681    1.366779088973999   0.7907446622848511  0.011520689353346825
11  30  0.0003          1   0.3992260694503784  0.27286234498023987 0.0025272746570408344   0.6061335802078247  0.3549347519874573  0.005023977253586054    1.4302551746368408  0.8462672233581543  0.011679759249091148
12  30  0.0003          1   0.38221779465675354 0.26070600748062134 0.002430235967040062    0.5497575998306274  0.32757341861724854 0.0044436827301979065   1.414415717124939   0.8409035205841064  0.011470243334770203
13  30  0.0003          1   0.37977200746536255 0.2584407925605774  0.002426624298095703    0.5415347218513489  0.32138586044311523 0.004402977414429188    1.4064550399780273  0.824271559715271   0.01164366863667965
14  30  0.0003          1   0.3757656216621399  0.25686323642730713 0.0023780479095876217   0.5533170700073242  0.3311866521835327  0.0044426084496080875   1.4327385425567627  0.8615032434463501  0.011424705386161804
Enable fine tuning.
Reset optimizer
Set LR to 0.0001
15  30  0.0001          1   0.4795773923397064  0.35832372307777405 0.0024250734131783247   0.5358351469039917  0.31826120615005493 0.004351478070020676    1.413381814956665   0.8462408781051636  0.011342821642756462
16  30  0.0001          1   0.6299121379852295  0.5210695266723633  0.0021768524311482906   0.47487348318099976 0.28786981105804443 0.0037400731816887856   1.3233580589294434  0.7700432538986206  0.011066293343901634
17  30  0.0001          1   0.6144391894340515  0.512069582939148   0.002047391841188073    0.4769752621650696  0.2955908477306366  0.0036276881583034992   1.3691893815994263  0.8130284547805786  0.011123216710984707
18  30  0.0001          1   0.4989803731441498  0.3984139561653137  0.0020113280043005943   0.4449610114097595  0.27712327241897583 0.003356754779815674    1.3166066408157349  0.7782672643661499  0.010766789317131042
19  30  0.0001          1   0.4655000865459442  0.3714534044265747  0.0018809335306286812   0.4717380404472351  0.2989097237586975  0.0034565662499517202   1.3422398567199707  0.7970119714736938  0.010904558934271336
Enable ReduceLROnPlateau sheduller.
20  30  0.0001          1   0.5363218784332275  0.4453100264072418  0.0018202371429651976   0.4626104235649109  0.2818910479545593  0.003614387707784772    1.4849011898040771  0.9170272350311279  0.01135747879743576
21  30  0.0001          1   0.281849205493927   0.1968272179365158  0.0017004397232085466   0.44213855266571045 0.27152156829833984 0.003412339836359024    1.3693543672561646  0.8299729824066162  0.010787628591060638
22  30  0.0001          1   0.2729610800743103  0.19033683836460114 0.0016524847596883774   0.4377482831478119  0.28067004680633545 0.003141565015539527    1.4153014421463013  0.8681168556213379  0.010943692177534103
23  30  0.0001          1   0.2624497413635254  0.18455162644386292 0.0015579622704535723   0.41277968883514404 0.2550632953643799  0.0031543271616101265   1.3907933235168457  0.836778998374939   0.011080287396907806
24  30  0.0001          1   0.25629687309265137 0.18024496734142303 0.0015210384735837579   0.4168449938297272  0.2559490203857422  0.0032179192639887333   1.4213714599609375  0.8677674531936646  0.011072080582380295
25  30  0.0001          1   0.25329867005348206 0.17836222052574158 0.0014987289905548096   0.4288163185119629  0.2658200263977051  0.003259925404563546    1.44132399559021    0.8860330581665039  0.011105818673968315
26  30  0.0001          1   0.24846401810646057 0.17477181553840637 0.0014738438185304403   0.4071950912475586  0.25658494234085083 0.0030122026801109314   1.4633606672286987  0.9079029560089111  0.011109152808785439
27  30  0.0001          1   0.2438085675239563  0.1722399890422821  0.001431371783837676    0.39068761467933655 0.24576705694198608 0.0028984113596379757   1.3962593078613281  0.8582543134689331  0.010760098695755005
28  30  0.0001          1   0.2439900040626526  0.17272409796714783 0.0014253179542720318   0.42904555797576904 0.2644585371017456  0.003291740547865629    1.4444527626037598  0.8932746648788452  0.011023562401533127
29  30  0.0001          1   0.2376411259174347  0.1679864227771759  0.0013930939603596926   0.40612727403640747 0.2632361650466919  0.002857822459191084    1.486081600189209   0.9195441007614136  0.011330748908221722
30  30  0.0001          1   0.2326730489730835  0.16453701257705688 0.0013627205044031143   0.38882362842559814 0.25098806619644165 0.0027567106299102306   1.4735137224197388  0.9105508327484131  0.011259258724749088
31  30  0.0001          1   0.23122538626194    0.16358232498168945 0.0013528612907975912   0.3767774701118469  0.24599818885326385 0.002615586156025529    1.4766323566436768  0.9195257425308228  0.011142131872475147
32  30  0.0001          1   0.22780099511146545 0.16204826533794403 0.0013150547165423632   0.401556134223938   0.2592620253562927  0.0028458822052925825   1.4210596084594727  0.8808301091194153  0.010804589837789536
33  30  0.0001          1   0.2252432107925415  0.16016073524951935 0.001301649259403348    0.44357579946517944 0.29039621353149414 0.0030635911971330643   1.482741355895996   0.9320613145828247  0.01101360097527504
34  30  0.0001          1   0.22264881432056427 0.15802402794361115 0.0012924957554787397   0.4255937933921814  0.2718261182308197  0.0030753538012504578   1.4795265197753906  0.927398681640625   0.011042557656764984
Enable EarlyStopping.
35  30  0.0001          1   0.22029781341552734 0.156030535697937   0.0012853458756580949   0.38960736989974976 0.24488675594329834 0.0028944117948412895   1.4586443901062012  0.9062024354934692  0.011048838496208191
36  30  9e-05           1   0.2191343605518341  0.1552455723285675  0.0012777757365256548   0.4114009439945221  0.26044589281082153 0.003019100520759821    1.4560773372650146  0.9133015871047974  0.010855517350137234
37  30  9e-05           1   0.21600446105003357 0.15394729375839233 0.0012411430943757296   0.4026799201965332  0.25782203674316406 0.0028971582651138306   1.4832677841186523  0.9204367995262146  0.011256620287895203
38  30  9e-05           1   0.2121611088514328  0.1509454995393753  0.0012243122328072786   0.4350014328956604  0.2779144048690796  0.0031417403370141983   1.5302469730377197  0.9682525992393494  0.011239887215197086
39  30  9e-05           1   0.20858639478683472 0.14899098873138428 0.0011919080279767513   0.41364163160324097 0.2610623836517334  0.003051585052162409    1.509250521659851   0.9475674629211426  0.011233661323785782
40  30  8.1e-05         1   0.20977042615413666 0.14929735660552979 0.0012094615958631039   0.4178823232650757  0.27387523651123047 0.002880142070353031    1.524194359779358   0.964811384677887   0.011187659576535225
41  30  8.1e-05         1   0.20953381061553955 0.1493002027273178  0.0012046722695231438   0.3944738507270813  0.2625211477279663  0.0026390543207526207   1.54041326045999    0.968309760093689   0.011442070826888084
42  30  8.1e-05         1   0.20757514238357544 0.14786458015441895 0.001194211421534419    0.42765507102012634 0.2829420566558838  0.002894259989261627    1.5071539878845215  0.951295018196106   0.011117178946733475
43  30  8.1e-05         1   0.20645642280578613 0.146955668926239   0.0011900151148438454   0.3827890157699585  0.24697105586528778 0.002716359216719866    1.5232229232788086  0.9606152176856995  0.011252151802182198
44  30  7.290000000000001e-05   1   0.20311015844345093 0.14486056566238403 0.0011649918742477894   0.39787060022354126 0.26592594385147095 0.0026388929691165686   1.547430157661438   0.9760066270828247  0.011428470723330975
45  30  7.290000000000001e-05   1   0.20135295391082764 0.1439100056886673  0.001148858806118369    0.3949892520904541  0.25512826442718506 0.0027972201351076365   1.5322182178497314  0.9669292569160461  0.011305777356028557
46  30  7.290000000000001e-05   1   0.20047995448112488 0.14363710582256317 0.001136857084929943    0.4053972363471985  0.2722339630126953  0.002663265448063612    1.5608159303665161  0.9872834086418152  0.011470651254057884
47  30  7.290000000000001e-05   1   0.19853650033473969 0.14200717210769653 0.0011305863736197352   0.37495923042297363 0.24560698866844177 0.0025870450772345066   1.5277628898620605  0.9650017023086548  0.011255222372710705
48  30  7.290000000000001e-05   1   0.19901308417320251 0.14273270964622498 0.0011256079887971282   0.38796257972717285 0.2609395384788513  0.0025404610205441713   1.5417020320892334  0.9759204983711243  0.011315631680190563
49  30  7.290000000000001e-05   1   0.1996999830007553  0.14253947138786316 0.0011432101018726826   0.3825112283229828  0.2523241639137268  0.0026037413626909256   1.5340099334716797  0.9751171469688416  0.011177854612469673
50  30  7.290000000000001e-05   1   0.19626972079277039 0.14043602347373962 0.0011166739277541637   0.3652791976928711  0.23652270436286926 0.0025751302018761635   1.5326642990112305  0.9633362293243408  0.011386560276150703
51  30  7.290000000000001e-05   1   0.19349460303783417 0.1387653946876526  0.0010945843532681465   0.39204931259155273 0.2554321885108948  0.0027323425747454166   1.5207853317260742  0.9554829001426697  0.01130604837089777
52  30  7.290000000000001e-05   1   0.19359028339385986 0.13905298709869385 0.0010907461401075125   0.371550589799881   0.24576491117477417 0.002515713684260845    1.5488594770431519  0.9790981411933899  0.011395227164030075
53  30  7.290000000000001e-05   1   0.19345125555992126 0.13833802938461304 0.0011022647377103567   0.3526538014411926  0.23115751147270203 0.002429925836622715    1.5580675601959229  0.995975911617279   0.011241831816732883
54  30  7.290000000000001e-05   1   0.19176039099693298 0.13732603192329407 0.0010886871023103595   0.3701578974723816  0.24271336197853088 0.002548890421167016    1.5414327383041382  0.9767463207244873  0.011293727904558182
55  30  7.290000000000001e-05   1   0.19322702288627625 0.13855600357055664 0.0010934207821264863   0.368161678314209   0.24193622171878815 0.0025245090946555138   1.5817300081253052  1.0103241205215454  0.01142811682075262
56  30  7.290000000000001e-05   1   0.19141548871994019 0.13745471835136414 0.0010792152024805546   0.3637664318084717  0.23634198307991028 0.0025484892539680004   1.5517998933792114  0.988541305065155   0.011265169829130173
57  30  7.290000000000001e-05   1   0.1904306262731552  0.13646692037582397 0.001079274108633399    0.3852692246437073  0.25353381037712097 0.0026347083039581776   1.573124647140503   0.9967824220657349  0.011526843532919884
58  30  6.561000000000002e-05   1   0.1871820092201233  0.13509798049926758 0.0010416810400784016   0.37520819902420044 0.2474292665719986  0.0025555784814059734   1.5476198196411133  0.9838130474090576  0.011276135221123695
59  30  6.561000000000002e-05   1   0.1847800463438034  0.13315635919570923 0.0010324737522751093   0.39045917987823486 0.26048392057418823 0.002599505241960287    1.5732523202896118  1.0057109594345093  0.011350827291607857
60  30  6.561000000000002e-05   1   0.18685294687747955 0.13430733978748322 0.0010509120766073465   0.39921700954437256 0.2658982276916504  0.0026663760654628277   1.5851812362670898  1.0117855072021484  0.011467915028333664
61  30  6.561000000000002e-05   1   0.18644125759601593 0.1339721381664276  0.001049382728524506    0.3954048156738281  0.2655892074108124  0.0025963117368519306   1.6084412336349487  1.0342886447906494  0.01148304995149374
62  30  5.904900000000002e-05   1   0.18621087074279785 0.13409340381622314 0.0010423494968563318   0.38124144077301025 0.24979856610298157 0.002628857269883156    1.5878326892852783  1.0134837627410889  0.011486977338790894
63  30  5.904900000000002e-05   1   0.1834608018398285  0.13197210431098938 0.0010297741973772645   0.36862853169441223 0.24548104405403137 0.0024629496037960052   1.6071679592132568  1.0346388816833496  0.01145058311522007
64  30  5.904900000000002e-05   1   0.18401990830898285 0.13211655616760254 0.0010380669264122844   0.3803935647010803  0.2539603114128113  0.002528664655983448    1.6067619323730469  1.0292704105377197  0.011549830436706543
65  30  5.904900000000002e-05   1   0.1801479607820511  0.1296975463628769  0.0010090083815157413   0.3926430344581604  0.25777149200439453 0.0026974305510520935   1.607120156288147   1.0243370532989502  0.011655662208795547
66  30  5.314410000000002e-05   1   0.18155881762504578 0.13082361221313477 0.0010147038847208023   0.3843800127506256  0.2576048672199249  0.002535502891987562    1.634810209274292   1.0468695163726807  0.01175881177186966
67  30  5.314410000000002e-05   1   0.18025223910808563 0.13036581873893738 0.0009977283189073205   0.37600213289260864 0.24873045086860657 0.0025454340502619743   1.6073455810546875  1.0307562351226807  0.011531788855791092
68  30  5.314410000000002e-05   1   0.17886173725128174 0.12943679094314575 0.000988499028608203    0.4111727476119995  0.27591991424560547 0.002705056220293045    1.6365418434143066  1.055818796157837   0.011614460498094559
69  30  5.314410000000002e-05   1   0.17877180874347687 0.12891322374343872 0.0009971719700843096   0.3828190565109253  0.2517228126525879  0.002621924504637718    1.6224792003631592  1.048477292060852   0.011480036191642284
70  30  4.782969000000002e-05   1   0.17890435457229614 0.12936143577098846 0.0009908585343509912   0.36745327711105347 0.24403075873851776 0.0024684504605829716   1.6078681945800781  1.033294916152954   0.01149146631360054
71  30  4.782969000000002e-05   1   0.17649614810943604 0.12765824794769287 0.0009767578449100256   0.38581573963165283 0.2528013288974762  0.0026602877769619226   1.6143182516098022  1.0433785915374756  0.011418793350458145
72  30  4.782969000000002e-05   1   0.17798292636871338 0.1286521553993225  0.0009866153122857213   0.3816826343536377  0.2575749158859253  0.0024821539409458637   1.6162062883377075  1.043137788772583   0.011461367830634117
73  30  4.782969000000002e-05   1   0.17776210606098175 0.12819987535476685 0.000991244800388813    0.3692198395729065  0.24155335128307343 0.002553330035880208    1.5938838720321655  1.0230413675308228  0.011416849680244923
74  30  4.304672100000002e-05   1   0.17558138072490692 0.12739460170269012 0.0009637356270104647   0.37902846932411194 0.24829795956611633 0.0026146103627979755   1.624485969543457   1.0433435440063477  0.011622847989201546
75  30  4.304672100000002e-05   1   0.1762307584285736  0.1275186538696289  0.0009742418769747019   0.360087513923645   0.23466652631759644 0.002508420031517744    1.619303584098816   1.0386667251586914  0.011612736620008945
76  30  4.304672100000002e-05   1   0.17535412311553955 0.1269785463809967  0.000967511790804565    0.3618285655975342  0.23789244890213013 0.0024787227157503366   1.6162554025650024  1.0366286039352417  0.011592534370720387
77  30  4.304672100000002e-05   1   0.17371511459350586 0.12560778856277466 0.000962146557867527    0.40583306550979614 0.27173560857772827 0.0026819496415555477   1.6362698078155518  1.0568699836730957  0.011587998829782009
78  30  3.874204890000002e-05   1   0.17146393656730652 0.1246747374534607  0.0009357838425785303   0.36760979890823364 0.24512331187725067 0.002449729945510626    1.6309565305709839  1.052272081375122   0.011573688127100468
79  30  3.874204890000002e-05   1   0.17498280107975006 0.1262698471546173  0.0009742592228576541   0.3842569589614868  0.2563364505767822  0.002558410167694092    1.630470633506775   1.0511462688446045  0.011586485430598259
80  30  3.874204890000002e-05   1   0.17245066165924072 0.1250864416360855  0.0009472845122218132   0.3557499647140503  0.23492515087127686 0.002416496630758047    1.6224180459976196  1.0472849607467651  0.011502662673592567
81  30  3.874204890000002e-05   1   0.17034471035003662 0.12369927763938904 0.0009329086751677096   0.3632718026638031  0.2403174340724945  0.0024590876419097185   1.627458095550537   1.046884298324585   0.01161147654056549
82  30  3.4867844010000016e-05  1   0.18225222826004028 0.13358429074287415 0.0009733590995892882   0.3568686842918396  0.23409464955329895 0.002455480396747589    1.4191677570343018  0.8532653450965881  0.01131804846227169
83  30  3.4867844010000016e-05  1   0.18599048256874084 0.1377549171447754  0.0009647110709920526   0.3496299684047699  0.22919687628746033 0.0024086618795990944   1.4392653703689575  0.8682156801223755  0.011420992203056812
Start training from snapshot snapshots/80_0.35574996471405030.234925150871276860.002416496630758047_checkpoint.pth
81  30  3.874204890000002e-05   1   0.1724991798400879  0.12507286667823792 0.0009485261980444193   0.34781113266944885 0.22930729389190674 0.002370076719671488    1.6278815269470215  1.0495078563690186  0.011567475274205208
82  30  3.874204890000002e-05   1   0.17397192120552063 0.12594695389270782 0.000960499164648354    0.35940518975257874 0.23898711800575256 0.0024083612952381372   1.6462254524230957  1.060730218887329   0.011709904298186302
83  30  3.874204890000002e-05   1   0.17461711168289185 0.12765100598335266 0.0009393221116624773   0.39857059717178345 0.26607546210289    0.0026499025989323854   1.3382909297943115  0.7800009250640869  0.01116579957306385
84  30  3.874204890000002e-05   1   0.18918484449386597 0.14046071469783783 0.0009744824492372572   0.3677120506763458  0.24088290333747864 0.002536582760512829    1.4385143518447876  0.8590496778488159  0.011589291505515575
Start training from snapshot snapshots/80_0.35574996471405030.234925150871276860.002416496630758047_checkpoint.pth
81  30  3.874204890000002e-05   1   0.17182175815105438 0.12470818310976028 0.0009422714356333017   0.38291144371032715 0.25593215227127075 0.00253958604298532 1.624972939491272   1.0430045127868652  0.011639367789030075
82  30  3.874204890000002e-05   1   0.17316262423992157 0.12534824013710022 0.0009562873747199774   0.38122129440307617 0.24563629925251007 0.002711699577048421    1.3504420518875122  0.8041160106658936  0.010926522314548492
83  30  3.4867844010000016e-05  1   0.17935985326766968 0.1306007206439972  0.000975182862021029    0.36995503306388855 0.23961582779884338 0.0026067837607115507   1.4804396629333496  0.9145945310592651  0.011316902004182339
84  30  3.4867844010000016e-05  1   0.17891967296600342 0.1303074061870575  0.0009722452377900481   0.3744434118270874  0.24323700368404388 0.0026241280138492584   1.487325668334961   0.9209854602813721  0.011326804757118225
85  30  3.4867844010000016e-05  1   0.18001699447631836 0.13056790828704834 0.0009889815701171756   0.3664088845252991  0.23736612498760223 0.0025808552745729685   1.5017907619476318  0.9313408136367798  0.01140899769961834
Start training from snapshot snapshots/81_0.347811132669448850.229307293891906740.002370076719671488_checkpoint.pth
82  30  3.874204890000002e-05   1   0.17035850882530212 0.12403380870819092 0.0009264942491427064   0.3499397337436676  0.22856155037879944 0.002427563536912203    1.6391862630844116  1.058560848236084   0.011612508445978165
83  30  3.874204890000002e-05   1   0.172819122672081   0.12525588274002075 0.0009512646938674152   0.3511059284210205  0.23205071687698364 0.0023811045102775097   1.6166735887527466  1.0386337041854858  0.011560797691345215
84  30  3.874204890000002e-05   1   0.17583176493644714 0.1268552839756012  0.0009795294608920813   0.3530122637748718  0.23116186261177063 0.002437007613480091    1.5328103303909302  0.9596449136734009  0.011463308706879616
85  30  3.874204890000002e-05   1   0.17785266041755676 0.12802061438560486 0.000996640883386135    0.350966215133667   0.23069368302822113 0.0024054506793618202   1.5213606357574463  0.9495564699172974  0.011436084285378456
Start training from snapshot snapshots/83_0.34962996840476990.229196876287460330.0024086618795990944_checkpoint.pth
84  30  3.4867844010000016e-05  1   0.17184239625930786 0.12444047629833221 0.0009480381850153208   0.3725314140319824  0.24531829357147217 0.002544262446463108    1.601953387260437   1.029545545578003   0.011448157951235771
85  30  3.4867844010000016e-05  1   0.17133831977844238 0.12490392476320267 0.0009286879212595522   0.3977296054363251  0.2585490643978119  0.0027836107183247805   1.401845932006836   0.8525246381759644  0.010986424051225185
86  30  3.4867844010000016e-05  1   0.1802639365196228  0.13258329033851624 0.0009536131983622909   0.3850559592247009  0.2493996024131775  0.002713126828894019    1.478780746459961   0.9177712202072144  0.011220188811421394
87  30  3.138105960900002e-05   1   0.1867193579673767  0.13581979274749756 0.0010179912205785513   0.3980981111526489  0.26101216673851013 0.0027417191304266453   1.5000394582748413  0.9343942403793335  0.011312903836369514
Start training from snapshot snapshots/83_0.34962996840476990.229196876287460330.0024086618795990944_checkpoint.pth
84  30  3.4867844010000016e-05  1   0.1744769811630249  0.1266784965991974  0.0009559695608913898   0.4133836030960083  0.2712627649307251  0.0028424165211617947   1.3031671047210693  0.7627624273300171  0.010808095335960388
85  30  3.4867844010000016e-05  1   0.18438966572284698 0.13491205871105194 0.0009895521216094494   0.3668936789035797  0.2392769455909729  0.002552334452047944    1.4190642833709717  0.8611618876457214  0.011158047243952751
rwightman commented 3 years ago

@MichaelMonashev I'm probably not going to be able to help you there, you're using your own training code (or modified) and not any of the public datasets. I've trained all of them with good resuls. The only nans tend to be in the beginning if LR is too agressive, wrong optimizer is used, or batch sizes are too small. Late epoch nan are usually indicative of a different sort of problem.

You might want to try enabling torch.autograd.detect_anomaly() ... start hunting for the first origin of the nan...

rwightman commented 3 years ago

Don't think there is likely to be an actual issue with the model here. If reproduction can be made w/ the training code here I'll look at it. These models, the focal loss specifically isn't the most stable and the official impl has plenty of questions about nan issues a well. It's best to stick within range of recommending learning rates, optimizers, etc unless you know how tune / debug those hparams

MichaelMonashev commented 3 years ago

@rwightman , I had hardware problems. I change GPUs and i am testing training code now. After some hours I did not see nans in loss.