Closed binzhouu closed 5 years ago
这不是bug, 是你训练的时候梯度爆炸了。你可以需要检查数据或者调参才行。
你好,我做来逐步debug后,发现并不是梯度爆炸的问题,而是“unk”字符对应的300维的embedding,应该是0,但是在代码执行过程中,data.build_pretrain_emb()方法,会将其的一部分数值显示为nan,导致后续损失无法计算。我截取来一部分log,展示embedding: “unk embedding”: [4.94065646e-324 2.59032689e-318 2.23850157e-314 0.00000000e+000 1.72688295e-318 nan nan 4.61116051e-309 5.17682093e-309 5.89638970e-310 3.23712580e-309 nan nan 6.29492174e-309 3.46572841e-309 1.89727644e-310 6.49591718e-309 nan 4.60891120e-309 4.89175202e-309 nan 4.92778351e-309 7.06560939e-309 6.78703378e-309 4.78263899e-309 4.08013107e-309 1.12847736e-310 4.08853417e-309 nan 5.65514000e-309 3.37045080e-309 1.04158164e-309 5.31999199e-309 2.12783128e-309 3.69723815e-309 7.37987696e-309 1.20196208e-309 nan nan 4.13647006e-309 nan 5.15271506e-309 5.46870145e-309 nan nan nan 3.65337649e-309 6.56246296e-309 nan 5.12759063e-309 nan 5.44644172e-309 4.79310043e-309 7.31318263e-309 nan 5.27211976e-309 3.86156550e-309 nan 5.72798812e-309 6.12730529e-309 5.84066610e-309 4.07486852e-310 1.74597814e-310 nan nan nan 6.86030629e-309 nan nan 1.80467254e-309 2.64905711e-309 nan nan nan 7.24305067e-309 2.74514108e-309 4.74121764e-309 nan 6.32356868e-309 2.76078019e-309 7.04545043e-309 5.79362145e-309 nan 1.72845045e-309 2.02499936e-309 nan 4.52897562e-309 5.28338756e-309 1.81600401e-310 nan 6.34971167e-309 8.99832315e-310 2.67725843e-309 nan nan 6.10986248e-310 2.22754386e-309 3.77481832e-310 3.02448061e-310 2.79369234e-309 7.73679665e-311 3.59886242e-309 nan nan 4.41256293e-309 4.18366324e-309 3.81263228e-309 4.99214364e-309 nan 4.54194101e-309 2.66079174e-309 1.23300688e-309 1.52683963e-309 nan 2.94006761e-309 4.33385810e-309 nan nan nan nan 7.24290213e-309 5.21616273e-309 nan nan 4.87543387e-309 nan nan 6.37776445e-309 nan 5.62549572e-309 nan 4.95539067e-309 nan 4.97226054e-310 nan 1.52251076e-309 nan 5.21501686e-310 4.63064044e-309 1.17340001e-309 2.16698210e-310 1.31879917e-309 7.08387977e-309 nan 5.49745450e-310 4.41205365e-309 1.33473535e-309 nan nan 7.31135772e-309 2.25998918e-309 nan 5.89554092e-310 2.18798987e-310 1.31892648e-309 2.27427021e-309 nan nan 2.99861347e-309 4.30277087e-309 1.09159708e-309 5.58129455e-309 3.79043620e-309 6.94302169e-309 4.42737446e-309 5.63644522e-310 4.98079096e-309 2.25574519e-309 4.27522736e-309 4.38285499e-309 3.08173205e-309 nan 1.52662743e-309 1.67701327e-310 4.62448665e-309 6.23202578e-309 nan 3.60590745e-309 6.88623708e-309 nan 1.12419093e-309 7.12182105e-309 nan 2.88544744e-309 3.84263730e-309 6.30867227e-309 2.68413370e-309 nan 1.21772850e-309 nan 1.52335956e-309 8.23589007e-310 2.94528772e-309 8.56946781e-310 nan nan 3.88569259e-309 nan nan 4.75585941e-309 nan 5.76705406e-309 nan 7.04328599e-309 6.97606117e-310 4.13214118e-309 8.30485493e-310 nan 2.11499320e-310 4.77956210e-309 1.48130160e-309 5.91279273e-309 nan 3.66207668e-309 5.57458904e-309 nan 4.20957281e-309 6.69396304e-309 nan 5.95875516e-309 9.29582697e-310 nan 5.36351412e-309 5.61112981e-309 nan 3.70608687e-309 nan nan nan 3.58871928e-309 nan 3.11744524e-309 6.30209408e-309 8.89752835e-310 2.11658470e-309 nan 4.16789681e-309 5.21109116e-309 nan 4.99912500e-309 2.52975850e-309 5.48385250e-309 4.64746786e-309 nan 5.83892606e-309 1.03360293e-309 nan nan nan 3.08928635e-309 2.82516154e-309 nan nan nan 4.84027253e-311 1.93439014e-309 nan nan 2.99564268e-309 3.31428157e-309 nan 4.38302475e-309 4.83036268e-309 8.29403276e-310 7.39678927e-309 6.27416862e-309 1.71624898e-309 1.00215495e-309 nan nan 3.32200563e-309 1.93360500e-309 6.91855508e-310 6.05800090e-309 1.95889920e-309 nan 1.33473535e-310 5.10687995e-309 nan 5.41473910e-309 nan nan 4.68536682e-311 6.31007278e-309 1.38112218e-309 nan 4.57432267e-309 4.41544884e-309 5.18562721e-309 2.11775180e-310 nan 6.77406838e-309 3.91692837e-309 4.84712645e-309 4.99078556e-309 1.11264727e-309 nan nan 2.16513597e-309 6.73523586e-309] 另外我的number_normalized设置为True,我可以手动修改,不知您有什么建议
这不是bug, 是你训练的时候梯度爆炸了。你可以需要检查数据或者调参才行。
验证出来了,是np.empty()的坑,在随机初始化时,有可能产生nan值。 即这行代码: pretrain_emb = np.empty([word_alphabet.size(), embedd_dim]),生成的numpy数组中会有nan值: pretrain_emb: [[4.94065646e-324 2.59032689e-318 2.22797229e-314 ... 5.17773339e-309 4.83815054e-312 6.10973516e-309] [5.93267585e-310 nan 3.65174256e-310 ... nan 3.27999012e-309 6.90690532e-309] [ nan 4.87490337e-309 7.30675299e-309 ... 3.20964595e-309 1.44577939e-309 2.46639571e-309] ... [0.00000000e+000 0.00000000e+000 0.00000000e+000 ... 0.00000000e+000 0.00000000e+000 0.00000000e+000] [0.00000000e+000 0.00000000e+000 0.00000000e+000 ... 0.00000000e+000 0.00000000e+000 0.00000000e+000] [0.00000000e+000 0.00000000e+000 0.00000000e+000 ... 0.00000000e+000 0.00000000e+000 0.00000000e+000]] 加一步转换就不会出现此问题,pretrain_emb = np.nan_to_num( np.empty([word_alphabet.size(), embedd_dim]) )
你好,我做来逐步debug后,发现并不是梯度爆炸的问题,而是“unk”字符对应的300维的embedding,应该是0,但是在代码执行过程中,data.build_pretrain_emb()方法,会将其的一部分数值显示为nan,导致后续损失无法计算。我截取来一部分log,展示embedding: “unk embedding”: [4.94065646e-324 2.59032689e-318 2.23850157e-314 0.00000000e+000 1.72688295e-318 nan nan 4.61116051e-309 5.17682093e-309 5.89638970e-310 3.23712580e-309 nan nan 6.29492174e-309 3.46572841e-309 1.89727644e-310 6.49591718e-309 nan 4.60891120e-309 4.89175202e-309 nan 4.92778351e-309 7.06560939e-309 6.78703378e-309 4.78263899e-309 4.08013107e-309 1.12847736e-310 4.08853417e-309 nan 5.65514000e-309 3.37045080e-309 1.04158164e-309 5.31999199e-309 2.12783128e-309 3.69723815e-309 7.37987696e-309 1.20196208e-309 nan nan 4.13647006e-309 nan 5.15271506e-309 5.46870145e-309 nan nan nan 3.65337649e-309 6.56246296e-309 nan 5.12759063e-309 nan 5.44644172e-309 4.79310043e-309 7.31318263e-309 nan 5.27211976e-309 3.86156550e-309 nan 5.72798812e-309 6.12730529e-309 5.84066610e-309 4.07486852e-310 1.74597814e-310 nan nan nan 6.86030629e-309 nan nan 1.80467254e-309 2.64905711e-309 nan nan nan 7.24305067e-309 2.74514108e-309 4.74121764e-309 nan 6.32356868e-309 2.76078019e-309 7.04545043e-309 5.79362145e-309 nan 1.72845045e-309 2.02499936e-309 nan 4.52897562e-309 5.28338756e-309 1.81600401e-310 nan 6.34971167e-309 8.99832315e-310 2.67725843e-309 nan nan 6.10986248e-310 2.22754386e-309 3.77481832e-310 3.02448061e-310 2.79369234e-309 7.73679665e-311 3.59886242e-309 nan nan 4.41256293e-309 4.18366324e-309 3.81263228e-309 4.99214364e-309 nan 4.54194101e-309 2.66079174e-309 1.23300688e-309 1.52683963e-309 nan 2.94006761e-309 4.33385810e-309 nan nan nan nan 7.24290213e-309 5.21616273e-309 nan nan 4.87543387e-309 nan nan 6.37776445e-309 nan 5.62549572e-309 nan 4.95539067e-309 nan 4.97226054e-310 nan 1.52251076e-309 nan 5.21501686e-310 4.63064044e-309 1.17340001e-309 2.16698210e-310 1.31879917e-309 7.08387977e-309 nan 5.49745450e-310 4.41205365e-309 1.33473535e-309 nan nan 7.31135772e-309 2.25998918e-309 nan 5.89554092e-310 2.18798987e-310 1.31892648e-309 2.27427021e-309 nan nan 2.99861347e-309 4.30277087e-309 1.09159708e-309 5.58129455e-309 3.79043620e-309 6.94302169e-309 4.42737446e-309 5.63644522e-310 4.98079096e-309 2.25574519e-309 4.27522736e-309 4.38285499e-309 3.08173205e-309 nan 1.52662743e-309 1.67701327e-310 4.62448665e-309 6.23202578e-309 nan 3.60590745e-309 6.88623708e-309 nan 1.12419093e-309 7.12182105e-309 nan 2.88544744e-309 3.84263730e-309 6.30867227e-309 2.68413370e-309 nan 1.21772850e-309 nan 1.52335956e-309 8.23589007e-310 2.94528772e-309 8.56946781e-310 nan nan 3.88569259e-309 nan nan 4.75585941e-309 nan 5.76705406e-309 nan 7.04328599e-309 6.97606117e-310 4.13214118e-309 8.30485493e-310 nan 2.11499320e-310 4.77956210e-309 1.48130160e-309 5.91279273e-309 nan 3.66207668e-309 5.57458904e-309 nan 4.20957281e-309 6.69396304e-309 nan 5.95875516e-309 9.29582697e-310 nan 5.36351412e-309 5.61112981e-309 nan 3.70608687e-309 nan nan nan 3.58871928e-309 nan 3.11744524e-309 6.30209408e-309 8.89752835e-310 2.11658470e-309 nan 4.16789681e-309 5.21109116e-309 nan 4.99912500e-309 2.52975850e-309 5.48385250e-309 4.64746786e-309 nan 5.83892606e-309 1.03360293e-309 nan nan nan 3.08928635e-309 2.82516154e-309 nan nan nan 4.84027253e-311 1.93439014e-309 nan nan 2.99564268e-309 3.31428157e-309 nan 4.38302475e-309 4.83036268e-309 8.29403276e-310 7.39678927e-309 6.27416862e-309 1.71624898e-309 1.00215495e-309 nan nan 3.32200563e-309 1.93360500e-309 6.91855508e-310 6.05800090e-309 1.95889920e-309 nan 1.33473535e-310 5.10687995e-309 nan 5.41473910e-309 nan nan 4.68536682e-311 6.31007278e-309 1.38112218e-309 nan 4.57432267e-309 4.41544884e-309 5.18562721e-309 2.11775180e-310 nan 6.77406838e-309 3.91692837e-309 4.84712645e-309 4.99078556e-309 1.11264727e-309 nan nan 2.16513597e-309 6.73523586e-309] �另外我的number_normalized设置为True,我可以手动修改,不知您有什么建议
请问这个log是怎么找的?log文件是怎么生成的?需要特别设置吗
作者,你好,感谢您的分享。 今天在引入char_embeddings和word_embeddings,模型是charCNN + wordLSTM + CRF,发现一个疑似bug。训练过程中,“ERROR: LOSS EXPLOSION (>1e8) ! PLEASE SET PROPER PARAMETERS AND STRUCTURE! EXIT....”导致退出,回看日志发现是sample_loss: nan导致,不知是否是bug问题,还请解答一下。 data.seg: True status: train Seed num: 42 MODEL: train total_column: 2 tagScheme: BIO tagScheme: BIO tagScheme: BIO train_texts_num: 23911 dev_texts_num: 2986 test_texts_num: 2987 Load pretrained word embedding, norm: False, dir: renmindata/embeddings/sgns.renmin.char Embedding: pretrain word:355792, prefect match:44646, case_match:1, oov:9674, oov%:0.1780862265748684 Load pretrained char embedding, norm: False, dir: renmindata/embeddings/sgns.renmin.char Embedding: pretrain word:355792, prefect match:4590, case_match:0, oov:88, oov%:0.018807437486642445 positive:negtive 3.896784763465083:1 Training model... ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DATA SUMMARY START: I/O: Start Sequence Laebling task... Tag scheme: BIO Split token: ||| MAX SENTENCE LENGTH: 250 MAX WORD LENGTH: -1 Number normalized: True Word alphabet size: 54322 Char alphabet size: 4679 Label alphabet size: 6 Word embedding dir: renmindata/embeddings/sgns.renmin.char Char embedding dir: renmindata/embeddings/sgns.renmin.char Word embedding size: 300 Char embedding size: 300 Norm word emb: False Norm char emb: False Train file directory: renmindata/dataset/train.txt Dev file directory: renmindata/dataset/dev.txt Test file directory: renmindata/dataset/test.txt Raw file directory: None Dset file directory: renmindata/model/lstmcrf Model file directory: renmindata/model/lstmcrf Loadmodel directory: None Decode file directory: None Train instance number: 23911 Dev instance number: 2986 Test instance number: 2987 Raw instance number: 0 FEATURE num: 0 ++++++++++++++++++++++++++++++++++++++++ Model Network: Model use_crf: True Model word extractor: LSTM Model use_char: True Model char extractor: CNN Model char_hidden_dim: 50 ++++++++++++++++++++++++++++++++++++++++ Training: Optimizer: SGD Iteration: 1 BatchSize: 10 Average batch loss: False ++++++++++++++++++++++++++++++++++++++++ Hyperparameters: Hyper lr: 0.015 Hyper lr_decay: 0.05 Hyper HP_clip: None Hyper momentum: 0.0 Hyper l2: 1e-08 Hyper hidden_dim: 200 Hyper dropout: 0.5 Hyper lstm_layer: 1 Hyper bilstm: True Hyper GPU: False DATA SUMMARY END. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ build sequence labeling network... use_char: True char feature extractor: CNN word feature extractor: LSTM use crf: True build word sequence feature extractor: LSTM... build word representation... build char sequence feature extractor: CNN ... lstm_hidden: 100 self.lstm: LSTM(350, 100, batch_first=True, bidirectional=True) self.word_hidden: WordSequence( (droplstm): Dropout(p=0.5) (wordrep): WordRep( (char_feature): CharCNN( (char_drop): Dropout(p=0.5) (char_embeddings): Embedding(4679, 300) (char_cnn): Conv1d(300, 50, kernel_size=(3,), stride=(1,), padding=(1,)) ) (drop): Dropout(p=0.5) (word_embedding): Embedding(54322, 300) (feature_embeddings): ModuleList() ) (lstm): LSTM(350, 100, batch_first=True, bidirectional=True) (hidden2tag): Linear(in_features=200, out_features=8, bias=True) ) build CRF... word2id has written down char2id has written down label2id has written down Epoch: 0/1 Learning rate is set as: 0.015 Before Shuffle: first input word list: [2, 3, 4] Shuffle: first input word list: [3571, 12526, 48391] train_num: 23911 total_batch: 2392 running calculate loss.. loss: tensor(nan, grad_fn=) tag_seq: tensor([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
sample_loss: nan
...
sample_loss: nan
Instance: 500; Time: 27.81s; loss: nan; acc: 0.0/16339.0=0.0000
ERROR: LOSS EXPLOSION (>1e8) ! PLEASE SET PROPER PARAMETERS AND STRUCTURE! EXIT....