swabhs / open-sesame

A frame-semantic parsing system based on a softmax-margin SegRNN.
Apache License 2.0
228 stars 65 forks source link

Error when Dynet is saving the model after training #8

Closed pinouchon closed 6 years ago

pinouchon commented 6 years ago

I have this line in globalconfig.py:20 VERSION="1.7" and I'm working with fn1.7 data. I ran the preprocess steps.

Now I'm trying to run the training with this command: python segrnn-argid.py

I'm getting this error:

Traceback (most recent call last):
  File "segrnn-argid.py", line 98, in <module>
    wvs = get_wvec_map()
  File "/Users/pinouchon/code/huggingface/open-sesame/src/dataio.py", line 276, in get_wvec_map
    raise Exception("word vector file not found!", FILTERED_WVECS_FILE)
Exception: ('word vector file not found!', '../data/glove.6B.100d.framenet.txt')

The error goes away if I replace FILTERED_WVECS_FILE = DATADIR + "glove.6B.100d.framenet.txt" with FILTERED_WVECS_FILE = DATADIR + "glove.6B.100d.txt" in globalconfig.py:85.

Now the training starts without errors (python segrnn-argid.py).

But after about 40min of training, I get this new error:

[dev epoch=0 after=2001] lprec = 0.40382 lrec = 0.14518 lf1 = 0.21358 -- savinglibc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Could not write model to tmp/1.7model.sra-1527520332.05

This looks like a low-level error inside dynet. I cannot find what is causing it with google/stackoverflow. I installed dynet with pip install dynet. I'm running OSX High Sierra and python 2.7.10 inside a virtualenv. I ran the training twice with both times the same error (and a different tmp file name in each case) The error doesn't look related to my fix in globalconfig.py:85. Any pointers?

Full output of the training:

[dynet] random seed: 1594657864
[dynet] allocating memory: 512MB
[dynet] memory allocation done.

COMMAND: segrnn-argid.py

PARSER SETTINGS
_____________________
PARSING MODE:       train
USING EXEMPLAR?     False
USING SPAN CLIP?    True
LOSS TYPE:          softmaxm
COST TYPE:          recall
R-O COST VALUE:     2
USING DROPOUT?      True
USING WORDVECS?     True
USING HIERARCHY?    False
USING D-SYNTAX?     False
USING C-SYNTAX?     False
USING PTB-CLOSS?    False
MODEL WILL BE SAVED TO  tmp/1.7model.sra-1527520332.05
_____________________
reading ../data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll...
# examples in ../data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll : 19391 in 3413 sents
# examples with missing arguments : 526

reading the frame-element - frame map from ../data/fndata-1.7/frame/...
# max FEs for frame: 32 in Frame(Traversing)

reading the word vectors file from ../data/glove.6B.100d.txt...
using pretrained embeddings of dimension 100
# words in vocab:       400575
# POS tags:             45
# lexical units:        9441
# LU POS tags:          14
# frames:               1223
# FEs:                  1287
# dependency relations: 1
# constituency labels:  1

clipping spans longer than 20...
longest span size: 102
longest FE span size: 89
# train examples before filter: 19391
# train examples after filter: 19391

reading ../data/neural/fn1.7/fn1.7.dev.syntaxnet.conll...
# examples in ../data/neural/fn1.7/fn1.7.dev.syntaxnet.conll : 2272 in 326 sents
# examples with missing arguments : 73

unknowns in dev

_____________________
# unseen, unlearnt test words in vocab: (45, 390570)
# unseen, unlearnt test POS tags:       (0, 1)
# unseen, unlearnt test lexical units:  (0, 6444)
# unseen, unlearnt test LU pos tags:    (0, 3)
# unseen, unlearnt test frames:         (0, 469)
# unseen, unlearnt test FEs:            (0, 521)
# unseen, unlearnt test deprels:        (0, 1)
# unseen, unlearnt test constit labels: (0, 1)

[lr=0.0005 clips=99 updates=100] 100 loss = 38.650128 [took 46.383 s]
[lr=0.0005 clips=100 updates=100] 200 loss = 20.779121 [took 50.716 s]
[lr=0.0005 clips=100 updates=100] 300 loss = 17.716823 [took 46.031 s]
[lr=0.0005 clips=99 updates=100] 400 loss = 18.769036 [took 41.463 s]
[lr=0.0005 clips=100 updates=100] 500 loss = 18.951144 [took 49.424 s]
[lr=0.0005 clips=100 updates=100] 600 loss = 20.763794 [took 51.008 s]
[lr=0.0005 clips=100 updates=100] 700 loss = 17.897359 [took 45.175 s]
[lr=0.0005 clips=100 updates=100] 800 loss = 17.369590 [took 42.235 s]
[lr=0.0005 clips=98 updates=100] 900 loss = 16.837128 [took 49.753 s]
[lr=0.0005 clips=100 updates=100] 1000 loss = 17.795842 [took 51.235 s]
[dev epoch=0 after=1001] wprec = 0.00000 wrec = 0.00000 wf1 = 0.00000
[dev epoch=0 after=1001] uprec = 0.00000 urec = 0.00000 uf1 = 0.00000
[dev epoch=0 after=1001] lprec = 0.00000 lrec = 0.00000 lf1 = 0.00000 [took 621.073 s]
[lr=0.0005 clips=100 updates=100] 1100 loss = 16.862659 [took 50.687 s]
[lr=0.0005 clips=100 updates=100] 1200 loss = 14.759756 [took 40.827 s]
[lr=0.0005 clips=100 updates=100] 1300 loss = 14.575772 [took 39.446 s]
[lr=0.0005 clips=100 updates=100] 1400 loss = 14.491017 [took 42.966 s]
[lr=0.0005 clips=100 updates=100] 1500 loss = 15.175744 [took 55.345 s]
[lr=0.0005 clips=100 updates=100] 1600 loss = 14.648142 [took 42.464 s]
[lr=0.0005 clips=100 updates=100] 1700 loss = 13.749653 [took 50.359 s]
[lr=0.0005 clips=100 updates=100] 1800 loss = 13.874129 [took 46.874 s]
[lr=0.0005 clips=100 updates=100] 1900 loss = 14.471691 [took 42.907 s]
[lr=0.0005 clips=100 updates=100] 2000 loss = 13.668519 [took 49.962 s]
[dev epoch=0 after=2001] wprec = 0.41848 wrec = 0.06883 wf1 = 0.11822
[dev epoch=0 after=2001] uprec = 0.55100 urec = 0.18472 uf1 = 0.27668
[dev epoch=0 after=2001] lprec = 0.40382 lrec = 0.14518 lf1 = 0.21358 -- savinglibc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Could not write model to tmp/1.7model.sra-1527520332.05
[1]    54727 abort      python segrnn-argid.py
pinouchon commented 6 years ago

Looks like I was simply missing the src/tmp folder