marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
255 stars 125 forks source link

Error in running Marian with a config file #449

Closed afarajian closed 4 years ago

afarajian commented 5 years ago

Hi, I am trying to run Marian with the parameters being sent in a config file (using --config option) but I get the Aborted (core dumped) error. The strange part is that if I pass the options as a part of the command for calling Marian (similar to the Marian-examples) it works without any problem. So, this means that the config parser has some issues and breaks while parsing the options.

Here is the part of the log where I get the error:

[2019-05-24 11:44:09] [config] Model is being created with Marian v1.7.8 b13ee2c9 2019-05-10 12:16:26 +0200 [2019-05-24 11:44:09] Using synchronous training [2019-05-24 11:44:09] [data] Loading vocabulary from JSON/Yaml file /home/amin/projects/wmt/vocab.ende.yml [2019-05-24 11:44:09] [data] Setting vocabulary size for input 0 to 36000 [2019-05-24 11:44:09] [data] Loading vocabulary from JSON/Yaml file /home/amin/projects/wmt/vocab.ende.yml [2019-05-24 11:44:10] [data] Setting vocabulary size for input 1 to 36000 [2019-05-24 11:44:10] Compiled without MPI support. Falling back to FakeMPIWrapper [2019-05-24 11:44:10] Error: Unhandled exception of type 'N4YAML18TypedBadConversionIfEE': yaml-cpp: error at line 1, column 1: bad conversion [2019-05-24 11:44:10] Error: Aborted from void unhandledException() in /home/amin/NLP/tools/NMT/marian-dev/src/common/logging.cpp:107

[CALL STACK] [0x561a51611326] + 0x16e326 [0x7f0f135cfab6] + 0x92ab6 [0x7f0f135cfaf1] + 0x92af1 [0x7f0f135cfd24] + 0x92d24 [0x561a515374b8] + 0x944b8 [0x561a515da9bb] + 0x1379bb [0x561a5187f269] + 0x3dc269 [0x561a5159b339] + 0xf8339 [0x561a515f56a2] + 0x1526a2 [0x561a51525002] + 0x82002 [0x561a51502645] + 0x5f645 [0x7f0f1298db97] __libc_start_main + 0xe7 [0x561a5152369a] + 0x8069a

Aborted (core dumped)

frankseide commented 5 years ago

Which parameter and what was its value?

Any chance you can recompile Marian with debug symbols enabled?

(@Marcin, do we have debug symbols enabled by default for Release builds?)

From: Amin Farajian notifications@github.com Sent: Friday, May 24, 2019 4:01 To: marian-nmt/marian-dev marian-dev@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [marian-nmt/marian-dev] Error in running Marian with a config file (#449)

Hi, I am trying to run Marian with the parameters being sent in a config file (using --config option) but I get the Aborted (core dumped) error. The strange part is that if I pass the options as a part of the command for calling Marian (similar to the Marian-examples) it works without any problem. So, this means that the config parser has some issues and breaks while parsing the options.

Here is the part of the log where I get the error:

[2019-05-24 11:44:09] [config] Model is being created with Marian v1.7.8 b13ee2c9 2019-05-10 12:16:26 +0200 [2019-05-24 11:44:09] Using synchronous training [2019-05-24 11:44:09] [data] Loading vocabulary from JSON/Yaml file /home/amin/projects/wmt/vocab.ende.yml [2019-05-24 11:44:09] [data] Setting vocabulary size for input 0 to 36000 [2019-05-24 11:44:09] [data] Loading vocabulary from JSON/Yaml file /home/amin/projects/wmt/vocab.ende.yml [2019-05-24 11:44:10] [data] Setting vocabulary size for input 1 to 36000 [2019-05-24 11:44:10] Compiled without MPI support. Falling back to FakeMPIWrapper [2019-05-24 11:44:10] Error: Unhandled exception of type 'N4YAML18TypedBadConversionIfEE': yaml-cpp: error at line 1, column 1: bad conversion [2019-05-24 11:44:10] Error: Aborted from void unhandledException() in /home/amin/NLP/tools/NMT/marian-dev/src/common/logging.cpp:107

[CALL STACK] [0x561a51611326] + 0x16e326 [0x7f0f135cfab6] + 0x92ab6 [0x7f0f135cfaf1] + 0x92af1 [0x7f0f135cfd24] + 0x92d24 [0x561a515374b8] + 0x944b8 [0x561a515da9bb] + 0x1379bb [0x561a5187f269] + 0x3dc269 [0x561a5159b339] + 0xf8339 [0x561a515f56a2] + 0x1526a2 [0x561a51525002] + 0x82002 [0x561a51502645] + 0x5f645 [0x7f0f1298db97] __libc_start_main + 0xe7 [0x561a5152369a] + 0x8069a

Aborted (core dumped)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarian-nmt%2Fmarian-dev%2Fissues%2F449%3Femail_source%3Dnotifications%26email_token%3DAD7GDFRQE4XDPFA5NYYJ45TPW7DGBA5CNFSM4HPOPSDKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GVVTVVA&data=02%7C01%7Cfseide%40microsoft.com%7C5d4023dafcc143ea009708d6e037146a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636942924503419549&sdata=%2BQhDpii5tITX0l%2F9ykmwUCQBP%2Fz0w2ucG2N4qQzoVGM%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD7GDFTO2CXRSP2O4DDOFZ3PW7DGBANCNFSM4HPOPSDA&data=02%7C01%7Cfseide%40microsoft.com%7C5d4023dafcc143ea009708d6e037146a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636942924503429545&sdata=XdxB7h0i2Bb0FpdgO6liItgOy4qrs6NS83xUIZrvTxk%3D&reserved=0.

emjotde commented 4 years ago

Closing due to inactivity.

stribizhev commented 11 months ago

@emjotde I got the same issue. It happened when I tried to re-train a model by adding a bit more sentence pairs to the training set and launched training with the previous configuration file and increased epochs number. I think it is related to the fact that the original model was trained on 8 GPUs, and this training was launched on just 2 GPUs. Not sure though, I will update once I find out the root cause.