jiesutd / LatticeLSTM

Chinese NER using Lattice LSTM. Code for ACL 2018 paper.
1.79k stars 457 forks source link

用自己的数据集MemoryError,log如下 #47

Closed Veyronl closed 5 years ago

Veyronl commented 5 years ago

CuDNN: True GPU available: False Status: train Seg: True Train file: ./rd_data/train.txt Dev file: ./rd_data/dev.txt Test file: ./rd_data/test.txt Raw file: None Char emb: data/gigaword_chn.all.a2b.uni.ite50.vec Bichar emb: None Gaz file: data/ctb.50d.vec Model saved to: ./rd_data/demo_test Load gaz file: data/ctb.50d.vec total size: 704368 gaz alphabet size: 31572 gaz alphabet size: 33642 gaz alphabet size: 35512 build word pretrain emb... Embedding: pretrain word:11327, prefect match:2497, case_match:0, oov:29, oov%:0.0114760585675 build biword pretrain emb... Embedding: pretrain word:0, prefect match:0, case_match:0, oov:91271, oov%:0.999989043737 build gaz pretrain emb... Embedding: pretrain word:704368, prefect match:35510, case_match:0, oov:1, oov%:2.81594953818e-05 Training model... DATA SUMMARY START: Tag scheme: BIO MAX SENTENCE LENGTH: 250 MAX WORD LENGTH: -1 Number normalized: False Use bigram: False Word alphabet size: 2527 Biword alphabet size: 91272 Char alphabet size: 2527 Gaz alphabet size: 35512 Label alphabet size: 5 Word embedding size: 50 Biword embedding size: 50 Char embedding size: 30 Gaz embedding size: 50 Norm word emb: True Norm biword emb: True Norm gaz emb: False Norm gaz dropout: 0.5 Train instance number: 28185 Dev instance number: 5885 Test instance number: 5977 Raw instance number: 0 Hyperpara iteration: 100 Hyperpara batch size: 1 Hyperpara lr: 0.015 Hyperpara lr_decay: 0.05 Hyperpara HP_clip: 5.0 Hyperpara momentum: 0 Hyperpara hidden_dim: 200 Hyperpara dropout: 0.5 Hyperpara lstm_layer: 1 Hyperpara bilstm: True Hyperpara GPU: False Hyperpara use_gaz: True Hyperpara fix gaz emb: False Hyperpara use_char: False DATA SUMMARY END. Traceback (most recent call last): File "main_test.py", line 444, in train(data, save_model_dir, seg) File "main_test.py", line 240, in train save_data_setting(data, save_data_name) File "main_test.py", line 90, in save_data_setting new_data = copy.deepcopy(data) File "/usr/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/lib/python2.7/copy.py", line 298, in _deepcopy_inst state = deepcopy(state, memo) File "/usr/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list y.append(deepcopy(a, memo)) File "/usr/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list y.append(deepcopy(a, memo)) File "/usr/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list y.append(deepcopy(a, memo)) File "/usr/lib/python2.7/copy.py", line 192, in deepcopy memo[d] = y MemoryError

Veyronl commented 5 years ago

调大了下内存,目前解决了,但是仍然有点疑问,在哪些步骤会非常消耗内存呢,我观察在 gaz alphabet size: 31572 gaz alphabet size: 33642 gaz alphabet size: 35512 build word pretrain emb... Embedding: pretrain word:11327, prefect match:2497, case_match:0, oov:29, oov%:0.0114760585675 build biword pretrain emb... Embedding: pretrain word:0, prefect match:0, case_match:0, oov:91271, oov%:0.999989043737 build gaz pretrain emb... Embedding: pretrain word:704368, prefect match:35510, case_match:0, oov:1, oov%:2.81594953818e-05 这个阶段,内存占用直线增加,

jiesutd commented 5 years ago

调大了下内存,目前解决了,但是仍然有点疑问,在哪些步骤会非常消耗内存呢,我观察在 gaz alphabet size: 31572 gaz alphabet size: 33642 gaz alphabet size: 35512 build word pretrain emb... Embedding: pretrain word:11327, prefect match:2497, case_match:0, oov:29, oov%:0.0114760585675 build biword pretrain emb... Embedding: pretrain word:0, prefect match:0, case_match:0, oov:91271, oov%:0.999989043737 build gaz pretrain emb... Embedding: pretrain word:704368, prefect match:35510, case_match:0, oov:1, oov%:2.81594953818e-05 这个阶段,内存占用直线增加,

这个阶段内存增加是正常的,因为这时是将加载pretrained embedding 的阶段。

如你前面error所示,在函数· save_data_setting·中的内存增加应该也不少,此处是深拷贝。如果你不想保存模型的话,可以把这行注释掉,内存消耗或许会降低不少。

Veyronl commented 5 years ago

thx, 1.您说的深拷贝指的是保存模型阶段吧? 2.训练速度的问题,处理完的数据格式是 char label char label 如下图的instance是模型内根据换行符做的处理,得到的500个句子吗? 然后基于每500个句子中的每一个句子做的匹配词和字符的关联,所以很慢是吗?(我看了之前的issues,目前没有做batch处理) 3.我的训练数据,标注方式为BIOES,没有改为BMES(我看之前的issues说的是不需要改),那么我在decode阶段得到的就会是BIOES的标注吗(这个我一会测试一下)?

jiesutd commented 5 years ago
  1. 是的
  2. 没看懂,图呢?
  3. decode的格式和训练语料的格式一样。
Veyronl commented 5 years ago

Epoch: 0/1 Learning rate is setted as: 0.015 Instance: 500; Time: 188.61s; loss: 5351.3859; acc: 26426.0/29645.0=0.8914 Instance: 1000; Time: 184.39s; loss: 2590.9531; acc: 52700.0/58342.0=0.9033 Instance: 1500; Time: 196.93s; loss: 3009.9114; acc: 80487.0/88716.0=0.9072 Instance: 2000; Time: 192.82s; loss: 2872.4978; acc: 108347.0/118665.0=0.9130 Instance: 2500; Time: 180.91s; loss: 1488.9217; acc: 135777.0/147559.0=0.9202 Instance: 3000; Time: 192.16s; loss: 1453.1604; acc: 164808.0/178088.0=0.9254 Instance: 3500; Time: 192.00s; loss: 1059.3942; acc: 193928.0/208391.0=0.9306 Instance: 4000; Time: 180.64s; loss: 892.0300; acc: 221946.0/237467.0=0.9346 Instance: 4500; Time: 182.93s; loss: 1006.5372; acc: 249981.0/266746.0=0.9371 Instance: 5000; Time: 181.15s; loss: 1002.5518; acc: 277711.0/295718.0=0.9391 Instance: 5500; Time: 185.72s; loss: 846.2271; acc: 306403.0/325531.0=0.9412 Instance: 6000; Time: 195.29s; loss: 878.5309; acc: 336225.0/356530.0=0.9430 Instance: 6500; Time: 186.83s; loss: 670.2538; acc: 365363.0/386562.0=0.9452 Instance: 7000; Time: 183.57s; loss: 646.9277; acc: 393546.0/415649.0=0.9468 Instance: 7500; Time: 196.29s; loss: 688.7295; acc: 422981.0/445980.0=0.9484 Instance: 8000; Time: 183.80s; loss: 867.9504; acc: 451560.0/475720.0=0.9492 Instance: 8500; Time: 183.41s; loss: 629.3206; acc: 480304.0/505325.0=0.9505 Instance: 9000; Time: 190.84s; loss: 768.0165; acc: 509903.0/535867.0=0.9515 Instance: 9500; Time: 188.83s; loss: 672.3151; acc: 539304.0/566222.0=0.9525 Instance: 10000; Time: 182.44s; loss: 564.7902; acc: 567700.0/595437.0=0.9534 Instance: 10500; Time: 182.84s; loss: 753.5829; acc: 596291.0/625042.0=0.9540 Instance: 11000; Time: 176.41s; loss: 635.1610; acc: 623895.0/653443.0=0.9548 Instance: 11500; Time: 175.80s; loss: 528.4501; acc: 651587.0/681912.0=0.9555 Instance: 12000; Time: 185.51s; loss: 518.7684; acc: 680669.0/711698.0=0.9564 Instance: 12500; Time: 173.05s; loss: 505.4261; acc: 707994.0/739726.0=0.9571 Instance: 13000; Time: 184.63s; loss: 511.4036; acc: 736783.0/769245.0=0.9578 Instance: 13500; Time: 198.48s; loss: 511.1784; acc: 767451.0/800611.0=0.9586 Instance: 14000; Time: 170.48s; loss: 538.9246; acc: 794375.0/828249.0=0.9591 Instance: 14500; Time: 170.52s; loss: 573.7534; acc: 821292.0/855948.0=0.9595 Instance: 15000; Time: 187.80s; loss: 954.0643; acc: 850920.0/886549.0=0.9598 Instance: 15500; Time: 187.41s; loss: 591.9061; acc: 880671.0/917047.0=0.9603 Instance: 16000; Time: 182.14s; loss: 472.7717; acc: 909166.0/946148.0=0.9609 Instance: 16500; Time: 190.94s; loss: 559.8140; acc: 939362.0/977069.0=0.9614

针对第二个问题: 我看是500个一组,进行训练的,所以,想问是如何从训练数据得到的(是直接基于不同句子之间有换行符区分吗) 以及训练速度慢是由于每词500个句子中的每一个句子做的匹配词和字符的关联/匹配,需要的计算,而每500个句子的路径查找都不同,需要缓存在内存中(我看了之前的issues,目前没有做batch处理)

jiesutd commented 5 years ago

@Veyronl 一起训练的数量是batch size 不是500, 500只是用来显示中间结果而已。 速度慢是因为每个句子都要单独计算,并没有并行计算。

Veyronl commented 5 years ago

好的,明白啦,多谢,这个Lattice跟动态规划寻找最优路径如出一辙,