marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.22k stars 228 forks source link

Unhandled St13runtime_error while saving model #321

Closed sullivan510 closed 4 years ago

sullivan510 commented 4 years ago

Hi,

I'm seeing this error while training a Transformer model with Marian. There's plenty of disk space in the target directory. What would resolve this issue?

[2020-03-14 00:11:06] Saving model weights and runtime parameters to /home/susan/FluencyTrainStudent/PreTrainingEpoch8/model.npz.orig.npz [2020-03-14 00:11:07] Error: Unhandled exception of type 'St13runtime_error': npz_save: error saving to file: /home/susan/FluencyTrainStudent/PreTrainingEpoch8/model.npz.orig.npz [2020-03-14 00:11:07] Error: Aborted from void unhandledException() in /home/susan/marian-dev/src/common/logging.cpp:113

[CALL STACK] [0x55ef838b0538] + 0x394538 [0x7f457e065ae6] + 0x92ae6 [0x7f457e065b21] + 0x92b21 [0x7f457e065d54] + 0x92d54 [0x55ef83920bb0] marian::io:: saveItemsNpz (std::cxx11::basic_string<char,std::char_traits,std::allocator> const&, std::vector<marian::io::Item,std::allocator> const&) + 0x3210 [0x55ef83921153] marian::io:: saveItems (std::cxx11::basic_string<char,std::char_traits,std::allocator> const&, std::vector<marian::io::Item,std::allocator> const&) + 0x373 [0x55ef83bdea88] marian::EncoderDecoder:: save (std::shared_ptr, std::cxx11::basic_string<char,std::char_traits,std::allocator> const&, bool) + 0x168 [0x55ef83b3f07c] marian::models::Trainer:: save (std::shared_ptr, std::__cxx11::basic_string<char,std::char_traits,std::allocator> const&, bool) + 0x4c [0x55ef83c33b57] marian::SyncGraphGroup:: save (bool) + 0xbb7 [0x55ef83c30bf4] marian::SyncGraphGroup:: update (std::vector<std::shared_ptr,std::allocator<std::shared_ptr>>, unsigned long) + 0x604 [0x55ef83c32d93] marian::SyncGraphGroup:: update (std::shared_ptr) + 0x283 [0x55ef8387d24f] marian::Train:: run () + 0x6ff [0x55ef8379ac91] mainTrainer (int, char**) + 0x221 [0x55ef8375c825] main + 0x35 [0x7f457d64db97] libc_start_main + 0xe7 [0x55ef83798ffa] _start + 0x2a

Aborted (core dumped)

emjotde commented 4 years ago

Hi, do you have a larger log? Did saving succeed earlier? If yes, make sure your disk isn't full :)

emjotde commented 4 years ago

Oh, you just said there is space. Then it's very odd. Do you have a log?

sullivan510 commented 4 years ago

Here it is. Thank you!

From: Marcin Junczys-Dowmunt notifications@github.com Sent: Friday, March 13, 2020 5:25 PM To: marian-nmt/marian marian@noreply.github.com Cc: Susan Sullivan susansu@microsoft.com; Author author@noreply.github.com Subject: Re: [marian-nmt/marian] Unhandled St13runtime_error while saving model (#321)

Oh, you just said there is space. Then it's very odd. Do you have a log?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarian-nmt%2Fmarian%2Fissues%2F321%23issuecomment-598981869&data=02%7C01%7Csusansu%40microsoft.com%7C78933224c2124afcea6608d7c7ae3027%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637197423254862560&sdata=Wcy7PsdtkamN1HUBFoxmZtW3LMab4eJAigIdm7IccHc%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJOE5TG572NOPKIP6UQL37DRHLFHHANCNFSM4LHWKECQ&data=02%7C01%7Csusansu%40microsoft.com%7C78933224c2124afcea6608d7c7ae3027%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637197423254872556&sdata=QLHlzk98d9utgCcR%2FFk6km9ouY5x%2FU6m9ytYJvxqFKs%3D&reserved=0.

emjotde commented 4 years ago

Not sure the attachment worked here.

sullivan510 commented 4 years ago

train.log

emjotde commented 4 years ago

Can you post the output of df -h ?

sullivan510 commented 4 years ago

Filesystem Size Used Avail Use% Mounted on udev 111G 0 111G 0% /dev tmpfs 23G 896K 23G 1% /run /dev/sda1 146G 111G 35G 77% / tmpfs 111G 73M 111G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 111G 0 111G 0% /sys/fs/cgroup /dev/sda15 105M 3.6M 101M 4% /boot/efi /dev/sdb1 1.4T 77M 1.4T 1% /mnt tmpfs 23G 8.0K 23G 1% /run/user/1000 Susan@127.0.0.1:/cygdrive/C/Users/Susan/X2GO~1/S-C4CE~1/spool 2.8T 638G 2.2T 23% /tmp/.x2go-susan/spool/C-susan-50-1584136326_stDXFCE_dp32

emjotde commented 4 years ago

Here's a hunch: the shuffling and saving to temporary file is filling up your disk. You have over 200M sentences and about 35G free, could be the case. Can you try to run a quick training with, say, 1M sentences? Or try --shuffle-in-ram?

sullivan510 commented 4 years ago

I moved some files onto another disk to make room, and I no longer see the error. Thank you so much!

emjotde commented 4 years ago

Great. Good luck with the rest. Closing.