Closed sullivan510 closed 4 years ago
Hi, do you have a larger log? Did saving succeed earlier? If yes, make sure your disk isn't full :)
Oh, you just said there is space. Then it's very odd. Do you have a log?
Here it is. Thank you!
From: Marcin Junczys-Dowmunt notifications@github.com Sent: Friday, March 13, 2020 5:25 PM To: marian-nmt/marian marian@noreply.github.com Cc: Susan Sullivan susansu@microsoft.com; Author author@noreply.github.com Subject: Re: [marian-nmt/marian] Unhandled St13runtime_error while saving model (#321)
Oh, you just said there is space. Then it's very odd. Do you have a log?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarian-nmt%2Fmarian%2Fissues%2F321%23issuecomment-598981869&data=02%7C01%7Csusansu%40microsoft.com%7C78933224c2124afcea6608d7c7ae3027%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637197423254862560&sdata=Wcy7PsdtkamN1HUBFoxmZtW3LMab4eJAigIdm7IccHc%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJOE5TG572NOPKIP6UQL37DRHLFHHANCNFSM4LHWKECQ&data=02%7C01%7Csusansu%40microsoft.com%7C78933224c2124afcea6608d7c7ae3027%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637197423254872556&sdata=QLHlzk98d9utgCcR%2FFk6km9ouY5x%2FU6m9ytYJvxqFKs%3D&reserved=0.
Not sure the attachment worked here.
Can you post the output of df -h
?
Filesystem Size Used Avail Use% Mounted on udev 111G 0 111G 0% /dev tmpfs 23G 896K 23G 1% /run /dev/sda1 146G 111G 35G 77% / tmpfs 111G 73M 111G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 111G 0 111G 0% /sys/fs/cgroup /dev/sda15 105M 3.6M 101M 4% /boot/efi /dev/sdb1 1.4T 77M 1.4T 1% /mnt tmpfs 23G 8.0K 23G 1% /run/user/1000 Susan@127.0.0.1:/cygdrive/C/Users/Susan/X2GO~1/S-C4CE~1/spool 2.8T 638G 2.2T 23% /tmp/.x2go-susan/spool/C-susan-50-1584136326_stDXFCE_dp32
Here's a hunch: the shuffling and saving to temporary file is filling up your disk. You have over 200M sentences and about 35G free, could be the case. Can you try to run a quick training with, say, 1M sentences? Or try --shuffle-in-ram
?
I moved some files onto another disk to make room, and I no longer see the error. Thank you so much!
Great. Good luck with the rest. Closing.
Hi,
I'm seeing this error while training a Transformer model with Marian. There's plenty of disk space in the target directory. What would resolve this issue?
[2020-03-14 00:11:06] Saving model weights and runtime parameters to /home/susan/FluencyTrainStudent/PreTrainingEpoch8/model.npz.orig.npz [2020-03-14 00:11:07] Error: Unhandled exception of type 'St13runtime_error': npz_save: error saving to file: /home/susan/FluencyTrainStudent/PreTrainingEpoch8/model.npz.orig.npz [2020-03-14 00:11:07] Error: Aborted from void unhandledException() in /home/susan/marian-dev/src/common/logging.cpp:113
[CALL STACK] [0x55ef838b0538] + 0x394538 [0x7f457e065ae6] + 0x92ae6 [0x7f457e065b21] + 0x92b21 [0x7f457e065d54] + 0x92d54 [0x55ef83920bb0] marian::io:: saveItemsNpz (std::cxx11::basic_string<char,std::char_traits,std::allocator> const&, std::vector<marian::io::Item,std::allocator> const&) + 0x3210
[0x55ef83921153] marian::io:: saveItems (std:: cxx11::basic_string<char,std::char_traits,std::allocator> const&, std::vector<marian::io::Item,std::allocator> const&) + 0x373
[0x55ef83bdea88] marian::EncoderDecoder:: save (std::shared_ptr, std::cxx11::basic_string<char,std::char_traits,std::allocator> const&, bool) + 0x168
[0x55ef83b3f07c] marian::models::Trainer:: save (std::shared_ptr, std::__cxx11::basic_string<char,std::char_traits,std::allocator> const&, bool) + 0x4c
[0x55ef83c33b57] marian::SyncGraphGroup:: save (bool) + 0xbb7
[0x55ef83c30bf4] marian::SyncGraphGroup:: update (std::vector<std::shared_ptr,std::allocator<std::shared_ptr>>, unsigned long) + 0x604
[0x55ef83c32d93] marian::SyncGraphGroup:: update (std::shared_ptr) + 0x283
[0x55ef8387d24f] marian::Train:: run () + 0x6ff
[0x55ef8379ac91] mainTrainer (int, char**) + 0x221
[0x55ef8375c825] main + 0x35
[0x7f457d64db97] libc_start_main + 0xe7
[0x55ef83798ffa] _start + 0x2a
Aborted (core dumped)