Error: Error reading from file '.'

wangxw1023 commented 5 years ago

Hi, I have no problem with marian training before(Chinese-English), but recently I changed a larger training corpus ( train.bpe.zh:18G train.bpe.en:20G total 38G), which is always irregular throwing out this error during training. Why? And what should I do to train these corpus normally? Thank you very much. During training, free -h: total used free shared buff/cache available Mem: 125G 53G 2.6G 48M 69G 71G Swap: 3.8G 2.9G 925M

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 10283 C /media/wangxiuwan/marian/build/marian 7195MiB | | 1 10283 C /media/wangxiuwan/marian/build/marian 7195MiB | | 2 10283 C /media/wangxiuwan/marian/build/marian 7195MiB | | 3 10283 C /media/wangxiuwan/marian/build/marian 7195MiB | +-----------------------------------------------------------------------------+

train.log: [2019-08-14 10:55:34] [marian] Marian v1.7.6 02f4af4 2018-12-12 18:51:10 -0800 [2019-08-14 10:55:34] [marian] Running on dbcloud-Super-Server as process 31002 with command line: [2019-08-14 10:55:34] [marian] /media/wangxiuwan/marian/build/marian --model /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz --type transformer --pretrained-model /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz --train-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en --max-length 100 --vocabs /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml --mini-batch-fit -w 6000 --maxi-batch 1000 --early-stopping 40 --cost-type=ce-mean-words --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --valid-metrics ce-mean-words perplexity translation --valid-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en --valid-script-path 'bash /media/wangxiuwan/marian/examples/transformer/back_dataset/scripts/validate_zhen.sh' --valid-translation-output /media/wangxiuwan/marian/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output --quiet-translation --valid-mini-batch 16 --beam-size 6 --normalize 0.6 --overwrite --keep-best --log /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/train.log --valid-log /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/valid.log --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings --devices 0 1 2 3 --sync-sgd --seed 1111 --exponential-smoothing [2019-08-14 10:55:34] [config] after-batches: 0 [2019-08-14 10:55:34] [config] after-epochs: 0 [2019-08-14 10:55:34] [config] allow-unk: false [2019-08-14 10:55:34] [config] beam-size: 6 [2019-08-14 10:55:34] [config] best-deep: false [2019-08-14 10:55:34] [config] clip-gemm: 0 [2019-08-14 10:55:34] [config] clip-norm: 5 [2019-08-14 10:55:34] [config] cost-type: ce-mean-words [2019-08-14 10:55:34] [config] cpu-threads: 0 [2019-08-14 10:55:34] [config] data-weighting-type: sentence [2019-08-14 10:55:34] [config] dec-cell: gru [2019-08-14 10:55:34] [config] dec-cell-base-depth: 2 [2019-08-14 10:55:34] [config] dec-cell-high-depth: 1 [2019-08-14 10:55:34] [config] dec-depth: 6 [2019-08-14 10:55:34] [config] devices: [2019-08-14 10:55:34] [config] - 0 [2019-08-14 10:55:34] [config] - 1 [2019-08-14 10:55:34] [config] - 2 [2019-08-14 10:55:34] [config] - 3 [2019-08-14 10:55:34] [config] dim-emb: 512 [2019-08-14 10:55:34] [config] dim-rnn: 1024 [2019-08-14 10:55:34] [config] dim-vocabs: [2019-08-14 10:55:34] [config] - 36000 [2019-08-14 10:55:34] [config] - 34366 [2019-08-14 10:55:34] [config] disp-first: 0 [2019-08-14 10:55:34] [config] disp-freq: 1000 [2019-08-14 10:55:34] [config] disp-label-counts: false [2019-08-14 10:55:34] [config] dropout-rnn: 0 [2019-08-14 10:55:34] [config] dropout-src: 0 [2019-08-14 10:55:34] [config] dropout-trg: 0 [2019-08-14 10:55:34] [config] early-stopping: 40 [2019-08-14 10:55:34] [config] embedding-fix-src: false [2019-08-14 10:55:34] [config] embedding-fix-trg: false [2019-08-14 10:55:34] [config] embedding-normalization: false [2019-08-14 10:55:34] [config] enc-cell: gru [2019-08-14 10:55:34] [config] enc-cell-depth: 1 [2019-08-14 10:55:34] [config] enc-depth: 6 [2019-08-14 10:55:34] [config] enc-type: bidirectional [2019-08-14 10:55:34] [config] exponential-smoothing: 0.0001 [2019-08-14 10:55:34] [config] grad-dropping-momentum: 0 [2019-08-14 10:55:34] [config] grad-dropping-rate: 0 [2019-08-14 10:55:34] [config] grad-dropping-warmup: 100 [2019-08-14 10:55:34] [config] guided-alignment: none [2019-08-14 10:55:34] [config] guided-alignment-cost: mse [2019-08-14 10:55:34] [config] guided-alignment-weight: 0.1 [2019-08-14 10:55:34] [config] ignore-model-config: false [2019-08-14 10:55:34] [config] interpolate-env-vars: false [2019-08-14 10:55:34] [config] keep-best: true [2019-08-14 10:55:34] [config] label-smoothing: 0.1 [2019-08-14 10:55:34] [config] layer-normalization: false [2019-08-14 10:55:34] [config] learn-rate: 0.0003 [2019-08-14 10:55:34] [config] log: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/train.log [2019-08-14 10:55:34] [config] log-level: info [2019-08-14 10:55:34] [config] lr-decay: 0 [2019-08-14 10:55:34] [config] lr-decay-freq: 50000 [2019-08-14 10:55:34] [config] lr-decay-inv-sqrt: 16000 [2019-08-14 10:55:34] [config] lr-decay-repeat-warmup: false [2019-08-14 10:55:34] [config] lr-decay-reset-optimizer: false [2019-08-14 10:55:34] [config] lr-decay-start: [2019-08-14 10:55:34] [config] - 10 [2019-08-14 10:55:34] [config] - 1 [2019-08-14 10:55:34] [config] lr-decay-strategy: epoch+stalled [2019-08-14 10:55:34] [config] lr-report: true [2019-08-14 10:55:34] [config] lr-warmup: 16000 [2019-08-14 10:55:34] [config] lr-warmup-at-reload: false [2019-08-14 10:55:34] [config] lr-warmup-cycle: false [2019-08-14 10:55:34] [config] lr-warmup-start-rate: 0 [2019-08-14 10:55:34] [config] max-length: 100 [2019-08-14 10:55:34] [config] max-length-crop: false [2019-08-14 10:55:34] [config] max-length-factor: 3 [2019-08-14 10:55:34] [config] maxi-batch: 1000 [2019-08-14 10:55:34] [config] maxi-batch-sort: trg [2019-08-14 10:55:34] [config] mini-batch: 64 [2019-08-14 10:55:34] [config] mini-batch-fit: true [2019-08-14 10:55:34] [config] mini-batch-fit-step: 10 [2019-08-14 10:55:34] [config] mini-batch-words: 0 [2019-08-14 10:55:34] [config] model: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 10:55:34] [config] multi-node: false [2019-08-14 10:55:34] [config] multi-node-overlap: true [2019-08-14 10:55:34] [config] n-best: false [2019-08-14 10:55:34] [config] no-nccl: false [2019-08-14 10:55:34] [config] no-reload: false [2019-08-14 10:55:34] [config] no-restore-corpus: false [2019-08-14 10:55:34] [config] no-shuffle: false [2019-08-14 10:55:34] [config] normalize: 0.6 [2019-08-14 10:55:34] [config] optimizer: adam [2019-08-14 10:55:34] [config] optimizer-delay: 1 [2019-08-14 10:55:34] [config] optimizer-params: [2019-08-14 10:55:34] [config] - 0.9 [2019-08-14 10:55:34] [config] - 0.98 [2019-08-14 10:55:34] [config] - 1e-09 [2019-08-14 10:55:34] [config] overwrite: true [2019-08-14 10:55:34] [config] pretrained-model: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 10:55:34] [config] quiet: false [2019-08-14 10:55:34] [config] quiet-translation: true [2019-08-14 10:55:34] [config] relative-paths: false [2019-08-14 10:55:34] [config] right-left: false [2019-08-14 10:55:34] [config] save-freq: 5000 [2019-08-14 10:55:34] [config] seed: 1111 [2019-08-14 10:55:34] [config] shuffle-in-ram: false [2019-08-14 10:55:34] [config] skip: false [2019-08-14 10:55:34] [config] sqlite: "" [2019-08-14 10:55:34] [config] sqlite-drop: false [2019-08-14 10:55:34] [config] sync-sgd: true [2019-08-14 10:55:34] [config] tempdir: /tmp [2019-08-14 10:55:34] [config] tied-embeddings: true [2019-08-14 10:55:34] [config] tied-embeddings-all: false [2019-08-14 10:55:34] [config] tied-embeddings-src: false [2019-08-14 10:55:34] [config] train-sets: [2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh [2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en [2019-08-14 10:55:34] [config] transformer-aan-activation: swish [2019-08-14 10:55:34] [config] transformer-aan-depth: 2 [2019-08-14 10:55:34] [config] transformer-aan-nogate: false [2019-08-14 10:55:34] [config] transformer-decoder-autoreg: self-attention [2019-08-14 10:55:34] [config] transformer-dim-aan: 2048 [2019-08-14 10:55:34] [config] transformer-dim-ffn: 2048 [2019-08-14 10:55:34] [config] transformer-dropout: 0.1 [2019-08-14 10:55:34] [config] transformer-dropout-attention: 0 [2019-08-14 10:55:34] [config] transformer-dropout-ffn: 0 [2019-08-14 10:55:34] [config] transformer-ffn-activation: swish [2019-08-14 10:55:34] [config] transformer-ffn-depth: 2 [2019-08-14 10:55:34] [config] transformer-guided-alignment-layer: last [2019-08-14 10:55:34] [config] transformer-heads: 8 [2019-08-14 10:55:34] [config] transformer-no-projection: false [2019-08-14 10:55:34] [config] transformer-postprocess: dan [2019-08-14 10:55:34] [config] transformer-postprocess-emb: d [2019-08-14 10:55:34] [config] transformer-preprocess: "" [2019-08-14 10:55:34] [config] transformer-tied-layers: [2019-08-14 10:55:34] [config] [] [2019-08-14 10:55:34] [config] type: transformer [2019-08-14 10:55:34] [config] ulr: false [2019-08-14 10:55:34] [config] ulr-dim-emb: 0 [2019-08-14 10:55:34] [config] ulr-dropout: 0 [2019-08-14 10:55:34] [config] ulr-keys-vectors: "" [2019-08-14 10:55:34] [config] ulr-query-vectors: "" [2019-08-14 10:55:34] [config] ulr-softmax-temperature: 1 [2019-08-14 10:55:34] [config] ulr-trainable-transformation: false [2019-08-14 10:55:34] [config] valid-freq: 5000 [2019-08-14 10:55:34] [config] valid-log: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/valid.log [2019-08-14 10:55:34] [config] valid-max-length: 1000 [2019-08-14 10:55:34] [config] valid-metrics: [2019-08-14 10:55:34] [config] - ce-mean-words [2019-08-14 10:55:34] [config] - perplexity [2019-08-14 10:55:34] [config] - translation [2019-08-14 10:55:34] [config] valid-mini-batch: 16 [2019-08-14 10:55:34] [config] valid-script-path: bash /media/wangxiuwan/marian/examples/transformer/back_dataset/scripts/validate_zhen.sh [2019-08-14 10:55:34] [config] valid-sets: [2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh [2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en [2019-08-14 10:55:34] [config] valid-translation-output: /media/wangxiuwan/marian/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output [2019-08-14 10:55:34] [config] version: v1.7.6 02f4af4 2018-12-12 18:51:10 -0800 [2019-08-14 10:55:34] [config] vocabs: [2019-08-14 10:55:34] [config] - /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml [2019-08-14 10:55:34] [config] - /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml [2019-08-14 10:55:34] [config] word-penalty: 0 [2019-08-14 10:55:34] [config] workspace: 6000 [2019-08-14 10:55:34] [config] Loaded model has been created with Marian v1.7.6 02f4af4 2018-12-12 18:51:10 -0800 [2019-08-14 10:55:34] Using synchronous training [2019-08-14 10:55:34] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml [2019-08-14 10:55:34] [data] Setting vocabulary size for input 0 to 36000 [2019-08-14 10:55:34] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml [2019-08-14 10:55:34] [data] Setting vocabulary size for input 1 to 34366 [2019-08-14 10:55:34] [batching] Collecting statistics for batch fitting with step size 10 [2019-08-14 10:55:34] Compiled without MPI support. Falling back to FakeMPIWrapper [2019-08-14 10:55:36] [memory] Extending reserved space to 6016 MB (device gpu0) [2019-08-14 10:55:36] [memory] Extending reserved space to 6016 MB (device gpu1) [2019-08-14 10:55:37] [memory] Extending reserved space to 6016 MB (device gpu2) [2019-08-14 10:55:37] [memory] Extending reserved space to 6016 MB (device gpu3) [2019-08-14 10:55:37] [comm] Using NCCL 2.3.7 for GPU communication [2019-08-14 10:55:37] [memory] Reserving 305 MB, device gpu0 [2019-08-14 10:55:38] [memory] Reserving 305 MB, device gpu0 [2019-08-14 10:55:46] [batching] Done [2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu0) [2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu1) [2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu2) [2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu3) [2019-08-14 10:55:47] [comm] Using NCCL 2.3.7 for GPU communication [2019-08-14 10:55:47] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 10:55:47] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 10:55:48] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 10:55:48] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 10:55:49] Loading Adam parameters from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu0 [2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu1 [2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu2 [2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu3 [2019-08-14 10:55:50] [data] Restoring the corpus state to epoch 1, batch 65000 [2019-08-14 10:55:50] [data] Shuffling files [2019-08-14 11:00:34] [data] Done reading 183177554 sentences [2019-08-14 11:13:07] [data] Done shuffling 183177554 sentences to temp files [2019-08-14 11:22:06] Training started [2019-08-14 11:22:06] [memory] Reserving 305 MB, device gpu0 [2019-08-14 11:22:07] [memory] Reserving 305 MB, device gpu2 [2019-08-14 11:22:07] [memory] Reserving 305 MB, device gpu1 [2019-08-14 11:22:07] [memory] Reserving 305 MB, device gpu3 [2019-08-14 11:22:07] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 11:22:10] [memory] Reserving 305 MB, device cpu0 [2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu0 [2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu1 [2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu2 [2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu3 [2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu3 [2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu2 [2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu1 [2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu0 [2019-08-14 11:27:43] Ep. 1 : Up. 66000 : Sen. 16,738,027 : Cost 3.13637638 : Time 1916.10s : 5001.61 words/s : L.r. 1.4771e-04 [2019-08-14 11:33:18] Ep. 1 : Up. 67000 : Sen. 17,173,219 : Cost 3.11160016 : Time 335.53s : 28835.85 words/s : L.r. 1.4660e-04 [2019-08-14 11:38:55] Ep. 1 : Up. 68000 : Sen. 17,606,919 : Cost 3.10447025 : Time 337.14s : 28981.41 words/s : L.r. 1.4552e-04 [2019-08-14 11:44:31] Ep. 1 : Up. 69000 : Sen. 18,040,301 : Cost 3.09355903 : Time 336.26s : 28664.33 words/s : L.r. 1.4446e-04 [2019-08-14 11:50:07] Ep. 1 : Up. 70000 : Sen. 18,465,808 : Cost 3.09279132 : Time 335.46s : 28678.63 words/s : L.r. 1.4343e-04 [2019-08-14 11:50:07] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 11:50:11] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 11:50:14] Saving Adam parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 11:50:28] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 11:50:31] [valid] Ep. 1 : Up. 70000 : ce-mean-words : 1.93185 : new best [2019-08-14 11:50:37] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 11:50:40] [valid] Ep. 1 : Up. 70000 : perplexity : 6.90226 : new best [2019-08-14 11:53:29] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 11:53:32] [valid] Ep. 1 : Up. 70000 : translation : 26.8 : new best [2019-08-14 11:59:09] Ep. 1 : Up. 71000 : Sen. 18,895,607 : Cost 3.08209920 : Time 542.52s : 17812.32 words/s : L.r. 1.4241e-04 [2019-08-14 12:04:48] Ep. 1 : Up. 72000 : Sen. 19,330,863 : Cost 3.07790542 : Time 338.44s : 28752.45 words/s : L.r. 1.4142e-04 [2019-08-14 12:08:34] Error: Error reading from file '.' [2019-08-14 12:08:34] Error: Aborted from marian::io::InputFileStream& marian::io::getline(marian::io::InputFileStream&, std::__cxx11::string&) in /media/wangxiuwan/marian/src/common/file_stream.h:218

[CALL STACK] [0x5b3f82]
[0x5b49f5]
[0x5a58cf]
[0x51638d]
[0x5171cb]
[0x517bae]
[0x43fab9]
[0x7f57f98d9a99] + 0xea99 [0x439142]
[0x440ee1]
[0x468d04]
[0x7f57f93f9c80] + 0xb8c80 [0x7f57f98d26ba] + 0x76ba [0x7f57f8b5f41d] clone + 0x6d

wangxw1023 commented 5 years ago

And another question is： when training, I have set --tempdir /media/wangxiuwan/tmp \ and I see the log info: [2019-08-14 15:00:36] [data] Done shuffling 183177554 sentences to temp files

then I check the directory /media/wangxiuwan/tmp, there is nothing. why? I used to think the tempdir is to save temp files, and the required memory is equal to the corpus' size.

It is clear that I have misunderstood. Can you explain to me? Thank you very much.

emjotde commented 5 years ago

Hi, The temporary files are invisible as they get delete as soon as they are opened. That way the still exist but immediately get removed by the OS once the process finishes. That is a relative fail-proof to make sure temporary files are not kept during irregular process termination. The directory is still being used.

I would indeed guess that the error is connected to temporary space, so changing to a different folder would be my suggestion. Also, maybe update to current master of marian-dev. That should be version 1.7.8. There might be better error reporting.

wangxw1023 commented 5 years ago

@emjotde Thank you very much. I have update my marian version to marian-dev 1.7.8. And the training is started normally and lasted for 6 hours. If the error "Error reading from file '.'" is thrown out again, then I contact you again.

emjotde commented 5 years ago

Great. I am closing this issue then. Feel free to re-open if you still have problems.

frankseide commented 5 years ago

But do we know where this comes from? Are we failing to detect errors while writing to the temp file? Then that's a bug.

emjotde commented 5 years ago

We changed quite a lot on error reporting, error bits and stream handling between the version that was used and current master. I would suspect a mix of user error like too little temp space and bad reporting behavior in that case of the older Marian version. I would consider this closed unless we get information that there is a bug. The version used was from December last year: v1.7.6 02f4af4 2018-12-12

wangxw1023 commented 5 years ago

@emjotde Hi, after 6 hours, the error throwed out again. tempdir: /tmp I think the tempdir's memory is enough. So I am confused. Error reading from file? Is my server's cache insufficient? then if my corpus's size is 38G, when I use "free -h" and "df -h" check my server, how I can assess the memory is OK to complete the training? Thank you very much! If you need more information, please feel free to contact me. df -h: (base) work@dbcloud-Super-Server:/tmp$ df -h Filesystem Size Used Avail Use% Mounted on udev 63G 0 63G 0% /dev tmpfs 13G 91M 13G 1% /run /dev/sda3 3.5T 2.5T 822G 76% / tmpfs 63G 216K 63G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 63G 0 63G 0% /sys/fs/cgroup /dev/sda2 512M 3.7M 509M 1% /boot/efi tmpfs 13G 28K 13G 1% /run/user/108 tmpfs 13G 0 13G 0% /run/user/1001 tmpfs 13G 0 13G 0% /run/user/0

(base) work@dbcloud-Super-Server:/tmp$ free -h total used free shared buff/cache available Mem: 125G 3.5G 87G 15M 35G 120G Swap: 3.8G 2.9G 913M

train.log `[2019-08-14 15:08:05] [marian] Marian v1.7.8 c65c26d 2019-08-11 18:27:00 +0100 [2019-08-14 15:08:05] [marian] Running on dbcloud-Super-Server as process 18137 with command line: [2019-08-14 15:08:05] [marian] /media/wangxiuwan/marian-dev/build/marian --model /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz --type transformer --train-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en --max-length 100 --vocabs /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml --mini-batch-fit -w 6000 --maxi-batch 1000 --early-stopping 40 --cost-type=ce-mean-words --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --valid-metrics ce-mean-words perplexity translation --valid-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en --valid-script-path 'bash /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/scripts/validate_zhen.sh' --valid-translation-output /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output --quiet-translation --valid-mini-batch 16 --beam-size 6 --normalize 0.6 --overwrite --keep-best --log /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/train.log --valid-log /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/valid.log --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings --devices 0 1 2 3 --sync-sgd --seed 1111 --exponential-smoothing [2019-08-14 15:08:05] [config] after-batches: 0 [2019-08-14 15:08:05] [config] after-epochs: 0 [2019-08-14 15:08:05] [config] allow-unk: false [2019-08-14 15:08:05] [config] beam-size: 6 [2019-08-14 15:08:05] [config] bert-class-symbol: "[CLS]" [2019-08-14 15:08:05] [config] bert-mask-symbol: "[MASK]" [2019-08-14 15:08:05] [config] bert-masking-fraction: 0.15 [2019-08-14 15:08:05] [config] bert-sep-symbol: "[SEP]" [2019-08-14 15:08:05] [config] bert-train-type-embeddings: true [2019-08-14 15:08:05] [config] bert-type-vocab-size: 2 [2019-08-14 15:08:05] [config] clip-gemm: 0 [2019-08-14 15:08:05] [config] clip-norm: 5 [2019-08-14 15:08:05] [config] cost-type: ce-mean-words [2019-08-14 15:08:05] [config] cpu-threads: 0 [2019-08-14 15:08:05] [config] data-weighting: "" [2019-08-14 15:08:05] [config] data-weighting-type: sentence [2019-08-14 15:08:05] [config] dec-cell: gru [2019-08-14 15:08:05] [config] dec-cell-base-depth: 2 [2019-08-14 15:08:05] [config] dec-cell-high-depth: 1 [2019-08-14 15:08:05] [config] dec-depth: 6 [2019-08-14 15:08:05] [config] devices: [2019-08-14 15:08:05] [config] - 0 [2019-08-14 15:08:05] [config] - 1 [2019-08-14 15:08:05] [config] - 2 [2019-08-14 15:08:05] [config] - 3 [2019-08-14 15:08:05] [config] dim-emb: 512 [2019-08-14 15:08:05] [config] dim-rnn: 1024 [2019-08-14 15:08:05] [config] dim-vocabs: [2019-08-14 15:08:05] [config] - 0 [2019-08-14 15:08:05] [config] - 0 [2019-08-14 15:08:05] [config] disp-first: 0 [2019-08-14 15:08:05] [config] disp-freq: 1000 [2019-08-14 15:08:05] [config] disp-label-counts: false [2019-08-14 15:08:05] [config] dropout-rnn: 0 [2019-08-14 15:08:05] [config] dropout-src: 0 [2019-08-14 15:08:05] [config] dropout-trg: 0 [2019-08-14 15:08:05] [config] dump-config: "" [2019-08-14 15:08:05] [config] early-stopping: 40 [2019-08-14 15:08:05] [config] embedding-fix-src: false [2019-08-14 15:08:05] [config] embedding-fix-trg: false [2019-08-14 15:08:05] [config] embedding-normalization: false [2019-08-14 15:08:05] [config] embedding-vectors: [2019-08-14 15:08:05] [config] [] [2019-08-14 15:08:05] [config] enc-cell: gru [2019-08-14 15:08:05] [config] enc-cell-depth: 1 [2019-08-14 15:08:05] [config] enc-depth: 6 [2019-08-14 15:08:05] [config] enc-type: bidirectional [2019-08-14 15:08:05] [config] exponential-smoothing: 0.0001 [2019-08-14 15:08:05] [config] grad-dropping-momentum: 0 [2019-08-14 15:08:05] [config] grad-dropping-rate: 0 [2019-08-14 15:08:05] [config] grad-dropping-warmup: 100 [2019-08-14 15:08:05] [config] guided-alignment: none [2019-08-14 15:08:05] [config] guided-alignment-cost: mse [2019-08-14 15:08:05] [config] guided-alignment-weight: 0.1 [2019-08-14 15:08:05] [config] ignore-model-config: false [2019-08-14 15:08:05] [config] input-types: [2019-08-14 15:08:05] [config] [] [2019-08-14 15:08:05] [config] interpolate-env-vars: false [2019-08-14 15:08:05] [config] keep-best: true [2019-08-14 15:08:05] [config] label-smoothing: 0.1 [2019-08-14 15:08:05] [config] layer-normalization: false [2019-08-14 15:08:05] [config] learn-rate: 0.0003 [2019-08-14 15:08:05] [config] log: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/train.log [2019-08-14 15:08:05] [config] log-level: info [2019-08-14 15:08:05] [config] log-time-zone: "" [2019-08-14 15:08:05] [config] lr-decay: 0 [2019-08-14 15:08:05] [config] lr-decay-freq: 50000 [2019-08-14 15:08:05] [config] lr-decay-inv-sqrt: [2019-08-14 15:08:05] [config] - 16000 [2019-08-14 15:08:05] [config] lr-decay-repeat-warmup: false [2019-08-14 15:08:05] [config] lr-decay-reset-optimizer: false [2019-08-14 15:08:05] [config] lr-decay-start: [2019-08-14 15:08:05] [config] - 10 [2019-08-14 15:08:05] [config] - 1 [2019-08-14 15:08:05] [config] lr-decay-strategy: epoch+stalled [2019-08-14 15:08:05] [config] lr-report: true [2019-08-14 15:08:05] [config] lr-warmup: 16000 [2019-08-14 15:08:05] [config] lr-warmup-at-reload: false [2019-08-14 15:08:05] [config] lr-warmup-cycle: false [2019-08-14 15:08:05] [config] lr-warmup-start-rate: 0 [2019-08-14 15:08:05] [config] max-length: 100 [2019-08-14 15:08:05] [config] max-length-crop: false [2019-08-14 15:08:05] [config] max-length-factor: 3 [2019-08-14 15:08:05] [config] maxi-batch: 1000 [2019-08-14 15:08:05] [config] maxi-batch-sort: trg [2019-08-14 15:08:05] [config] mini-batch: 64 [2019-08-14 15:08:05] [config] mini-batch-fit: true [2019-08-14 15:08:05] [config] mini-batch-fit-step: 10 [2019-08-14 15:08:05] [config] mini-batch-overstuff: 1 [2019-08-14 15:08:05] [config] mini-batch-track-lr: false [2019-08-14 15:08:05] [config] mini-batch-understuff: 1 [2019-08-14 15:08:05] [config] mini-batch-warmup: 0 [2019-08-14 15:08:05] [config] mini-batch-words: 0 [2019-08-14 15:08:05] [config] mini-batch-words-ref: 0 [2019-08-14 15:08:05] [config] model: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 15:08:05] [config] multi-loss-type: sum [2019-08-14 15:08:05] [config] multi-node: false [2019-08-14 15:08:05] [config] multi-node-overlap: true [2019-08-14 15:08:05] [config] n-best: false [2019-08-14 15:08:05] [config] no-nccl: false [2019-08-14 15:08:05] [config] no-reload: false [2019-08-14 15:08:05] [config] no-restore-corpus: false [2019-08-14 15:08:05] [config] no-shuffle: false [2019-08-14 15:08:05] [config] normalize: 0.6 [2019-08-14 15:08:05] [config] num-devices: 0 [2019-08-14 15:08:05] [config] optimizer: adam [2019-08-14 15:08:05] [config] optimizer-delay: 1 [2019-08-14 15:08:05] [config] optimizer-params: [2019-08-14 15:08:05] [config] - 0.9 [2019-08-14 15:08:05] [config] - 0.98 [2019-08-14 15:08:05] [config] - 1e-09 [2019-08-14 15:08:05] [config] overwrite: true [2019-08-14 15:08:05] [config] pretrained-model: "" [2019-08-14 15:08:05] [config] quiet: false [2019-08-14 15:08:05] [config] quiet-translation: true [2019-08-14 15:08:05] [config] relative-paths: false [2019-08-14 15:08:05] [config] right-left: false [2019-08-14 15:08:05] [config] save-freq: 5000 [2019-08-14 15:08:05] [config] seed: 1111 [2019-08-14 15:08:05] [config] shuffle-in-ram: false [2019-08-14 15:08:05] [config] skip: false [2019-08-14 15:08:05] [config] sqlite: "" [2019-08-14 15:08:05] [config] sqlite-drop: false [2019-08-14 15:08:05] [config] sync-sgd: true [2019-08-14 15:08:05] [config] tempdir: /tmp [2019-08-14 15:08:05] [config] tied-embeddings: true [2019-08-14 15:08:05] [config] tied-embeddings-all: false [2019-08-14 15:08:05] [config] tied-embeddings-src: false [2019-08-14 15:08:05] [config] train-sets: [2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh [2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en [2019-08-14 15:08:05] [config] transformer-aan-activation: swish [2019-08-14 15:08:05] [config] transformer-aan-depth: 2 [2019-08-14 15:08:05] [config] transformer-aan-nogate: false [2019-08-14 15:08:05] [config] transformer-decoder-autoreg: self-attention [2019-08-14 15:08:05] [config] transformer-dim-aan: 2048 [2019-08-14 15:08:05] [config] transformer-dim-ffn: 2048 [2019-08-14 15:08:05] [config] transformer-dropout: 0.1 [2019-08-14 15:08:05] [config] transformer-dropout-attention: 0 [2019-08-14 15:08:05] [config] transformer-dropout-ffn: 0 [2019-08-14 15:08:05] [config] transformer-ffn-activation: swish [2019-08-14 15:08:05] [config] transformer-ffn-depth: 2 [2019-08-14 15:08:05] [config] transformer-guided-alignment-layer: last [2019-08-14 15:08:05] [config] transformer-heads: 8 [2019-08-14 15:08:05] [config] transformer-no-projection: false [2019-08-14 15:08:05] [config] transformer-postprocess: dan [2019-08-14 15:08:05] [config] transformer-postprocess-emb: d [2019-08-14 15:08:05] [config] transformer-preprocess: "" [2019-08-14 15:08:05] [config] transformer-tied-layers: [2019-08-14 15:08:05] [config] [] [2019-08-14 15:08:05] [config] transformer-train-position-embeddings: false [2019-08-14 15:08:05] [config] type: transformer [2019-08-14 15:08:05] [config] ulr: false [2019-08-14 15:08:05] [config] ulr-dim-emb: 0 [2019-08-14 15:08:05] [config] ulr-dropout: 0 [2019-08-14 15:08:05] [config] ulr-keys-vectors: "" [2019-08-14 15:08:05] [config] ulr-query-vectors: "" [2019-08-14 15:08:05] [config] ulr-softmax-temperature: 1 [2019-08-14 15:08:05] [config] ulr-trainable-transformation: false [2019-08-14 15:08:05] [config] valid-freq: 5000 [2019-08-14 15:08:05] [config] valid-log: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/valid.log [2019-08-14 15:08:05] [config] valid-max-length: 1000 [2019-08-14 15:08:05] [config] valid-metrics: [2019-08-14 15:08:05] [config] - ce-mean-words [2019-08-14 15:08:05] [config] - perplexity [2019-08-14 15:08:05] [config] - translation [2019-08-14 15:08:05] [config] valid-mini-batch: 16 [2019-08-14 15:08:05] [config] valid-script-path: bash /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/scripts/validate_zhen.sh [2019-08-14 15:08:05] [config] valid-sets: [2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh [2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en [2019-08-14 15:08:05] [config] valid-translation-output: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output [2019-08-14 15:08:05] [config] vocabs: [2019-08-14 15:08:05] [config] - /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml [2019-08-14 15:08:05] [config] - /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml [2019-08-14 15:08:05] [config] word-penalty: 0 [2019-08-14 15:08:05] [config] workspace: 6000 [2019-08-14 15:08:05] [config] Model is being created with Marian v1.7.8 c65c26d 2019-08-11 18:27:00 +0100 [2019-08-14 15:08:05] Using synchronous training [2019-08-14 15:08:05] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml [2019-08-14 15:08:06] [data] Setting vocabulary size for input 0 to 36000 [2019-08-14 15:08:06] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml [2019-08-14 15:08:06] [data] Setting vocabulary size for input 1 to 34366 [2019-08-14 15:08:06] Compiled without MPI support. Falling back to FakeMPIWrapper [2019-08-14 15:08:06] [batching] Collecting statistics for batch fitting with step size 10 [2019-08-14 15:08:08] [memory] Extending reserved space to 6016 MB (device gpu0) [2019-08-14 15:08:09] [memory] Extending reserved space to 6016 MB (device gpu1) [2019-08-14 15:08:10] [memory] Extending reserved space to 6016 MB (device gpu2) [2019-08-14 15:08:10] [memory] Extending reserved space to 6016 MB (device gpu3) [2019-08-14 15:08:10] [comm] Using NCCL 2.4.2 for GPU communication [2019-08-14 15:08:10] [comm] NCCLCommunicator constructed successfully. [2019-08-14 15:08:10] [training] Using 4 GPUs [2019-08-14 15:08:10] [memory] Reserving 305 MB, device gpu0 [2019-08-14 15:08:10] [gpu] 16-bit TensorCores enabled for float32 matrix operations [2019-08-14 15:08:11] [memory] Reserving 305 MB, device gpu0 [2019-08-14 15:08:19] [batching] Done. Typical MB size is 16312 target words [2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu0) [2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu1) [2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu2) [2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu3) [2019-08-14 15:08:20] [comm] Using NCCL 2.4.2 for GPU communication [2019-08-14 15:08:20] [comm] NCCLCommunicator constructed successfully. [2019-08-14 15:08:20] [training] Using 4 GPUs [2019-08-14 15:08:20] Training started [2019-08-14 15:08:20] [data] Shuffling data [2019-08-14 15:10:29] [data] Done reading 183177554 sentences [2019-08-14 15:23:10] [data] Done shuffling 183177554 sentences to temp files [2019-08-14 15:23:59] [training] Batches are processed as 1 process(es) x 4 devices/process [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu2 [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu1 [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu0 [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu3 [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu3 [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu2 [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu1 [2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu0 [2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu0 [2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu1 [2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu2 [2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu3 [2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu2 [2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu1 [2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu0 [2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu3 [2019-08-14 15:29:25] Ep. 1 : Up. 1000 : Sen. 430,572 : Cost 8.71003532 : Time 1279.01s : 7530.88 words/s : L.r. 1.8750e-05 [2019-08-14 15:34:51] Ep. 1 : Up. 2000 : Sen. 860,866 : Cost 7.37228441 : Time 326.23s : 29710.28 words/s : L.r. 3.7500e-05 [2019-08-14 15:40:17] Ep. 1 : Up. 3000 : Sen. 1,294,446 : Cost 6.85132647 : Time 325.73s : 29540.07 words/s : L.r. 5.6250e-05 [2019-08-14 15:45:45] Ep. 1 : Up. 4000 : Sen. 1,725,695 : Cost 6.48512745 : Time 327.94s : 29417.02 words/s : L.r. 7.5000e-05 [2019-08-14 15:51:12] Ep. 1 : Up. 5000 : Sen. 2,154,290 : Cost 6.19205332 : Time 326.97s : 29413.35 words/s : L.r. 9.3750e-05 [2019-08-14 15:51:12] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 15:51:14] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 15:51:15] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 15:51:25] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 15:51:26] [valid] Ep. 1 : Up. 5000 : ce-mean-words : 5.18464 : new best [2019-08-14 15:51:32] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 15:51:33] [valid] Ep. 1 : Up. 5000 : perplexity : 178.508 : new best [2019-08-14 16:01:44] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 16:01:45] [valid] Ep. 1 : Up. 5000 : translation : 2.82 : new best [2019-08-14 16:07:14] Ep. 1 : Up. 6000 : Sen. 2,587,084 : Cost 5.97804403 : Time 962.52s : 10096.13 words/s : L.r. 1.1250e-04 [2019-08-14 16:12:44] Ep. 1 : Up. 7000 : Sen. 3,021,933 : Cost 5.73995399 : Time 329.30s : 29539.79 words/s : L.r. 1.3125e-04 [2019-08-14 16:18:10] Ep. 1 : Up. 8000 : Sen. 3,453,915 : Cost 5.47970676 : Time 326.02s : 29519.30 words/s : L.r. 1.5000e-04 [2019-08-14 16:23:38] Ep. 1 : Up. 9000 : Sen. 3,885,778 : Cost 5.15550852 : Time 328.07s : 29405.33 words/s : L.r. 1.6875e-04 [2019-08-14 16:29:05] Ep. 1 : Up. 10000 : Sen. 4,312,859 : Cost 4.82228327 : Time 327.86s : 29455.02 words/s : L.r. 1.8750e-04 [2019-08-14 16:29:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 16:29:09] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 16:29:13] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 16:29:27] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 16:29:30] [valid] Ep. 1 : Up. 10000 : ce-mean-words : 3.72119 : new best [2019-08-14 16:29:36] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 16:29:38] [valid] Ep. 1 : Up. 10000 : perplexity : 41.3133 : new best [2019-08-14 16:37:04] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 16:37:07] [valid] Ep. 1 : Up. 10000 : translation : 11.21 : new best [2019-08-14 16:42:39] Ep. 1 : Up. 11000 : Sen. 4,753,050 : Cost 4.46836233 : Time 813.71s : 11942.40 words/s : L.r. 2.0625e-04 [2019-08-14 16:48:12] Ep. 1 : Up. 12000 : Sen. 5,187,010 : Cost 4.21248579 : Time 332.70s : 29257.87 words/s : L.r. 2.2500e-04 [2019-08-14 16:53:44] Ep. 1 : Up. 13000 : Sen. 5,619,950 : Cost 4.04143047 : Time 332.22s : 29239.11 words/s : L.r. 2.4375e-04 [2019-08-14 16:59:17] Ep. 1 : Up. 14000 : Sen. 6,052,294 : Cost 3.91244173 : Time 332.71s : 29292.93 words/s : L.r. 2.6250e-04 [2019-08-14 17:04:46] Ep. 1 : Up. 15000 : Sen. 6,487,970 : Cost 3.82174230 : Time 329.43s : 29337.17 words/s : L.r. 2.8125e-04 [2019-08-14 17:04:46] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 17:04:49] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 17:04:52] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 17:05:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 17:05:08] [valid] Ep. 1 : Up. 15000 : ce-mean-words : 2.63367 : new best [2019-08-14 17:05:14] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 17:05:17] [valid] Ep. 1 : Up. 15000 : perplexity : 13.9248 : new best [2019-08-14 17:09:30] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 17:09:33] [valid] Ep. 1 : Up. 15000 : translation : 20.84 : new best [2019-08-14 17:15:10] Ep. 1 : Up. 16000 : Sen. 6,925,326 : Cost 3.74980569 : Time 623.42s : 15682.17 words/s : L.r. 3.0000e-04 [2019-08-14 17:20:38] Ep. 1 : Up. 17000 : Sen. 7,357,788 : Cost 3.68259025 : Time 328.55s : 29373.51 words/s : L.r. 2.9104e-04 [2019-08-14 17:26:05] Ep. 1 : Up. 18000 : Sen. 7,788,773 : Cost 3.61811399 : Time 326.90s : 29333.35 words/s : L.r. 2.8284e-04 [2019-08-14 17:31:32] Ep. 1 : Up. 19000 : Sen. 8,215,891 : Cost 3.56085038 : Time 327.35s : 29362.99 words/s : L.r. 2.7530e-04 [2019-08-14 17:37:01] Ep. 1 : Up. 20000 : Sen. 8,641,960 : Cost 3.51233172 : Time 328.51s : 29077.22 words/s : L.r. 2.6833e-04 [2019-08-14 17:37:01] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 17:37:04] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 17:37:07] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 17:37:21] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 17:37:23] [valid] Ep. 1 : Up. 20000 : ce-mean-words : 2.31458 : new best [2019-08-14 17:37:29] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 17:37:32] [valid] Ep. 1 : Up. 20000 : perplexity : 10.1206 : new best [2019-08-14 17:41:41] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 17:41:44] [valid] Ep. 1 : Up. 20000 : translation : 23.69 : new best [2019-08-14 17:47:14] Ep. 1 : Up. 21000 : Sen. 9,068,108 : Cost 3.47109151 : Time 613.25s : 15683.43 words/s : L.r. 2.6186e-04 [2019-08-14 17:52:46] Ep. 1 : Up. 22000 : Sen. 9,508,053 : Cost 3.43533826 : Time 332.24s : 29343.18 words/s : L.r. 2.5584e-04 [2019-08-14 17:58:16] Ep. 1 : Up. 23000 : Sen. 9,945,064 : Cost 3.40507197 : Time 329.95s : 29401.35 words/s : L.r. 2.5022e-04 [2019-08-14 18:03:45] Ep. 1 : Up. 24000 : Sen. 10,368,000 : Cost 3.38107204 : Time 328.61s : 29104.86 words/s : L.r. 2.4495e-04 [2019-08-14 18:09:16] Ep. 1 : Up. 25000 : Sen. 10,800,209 : Cost 3.35002422 : Time 330.99s : 29296.81 words/s : L.r. 2.4000e-04 [2019-08-14 18:09:16] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 18:09:20] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 18:09:22] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 18:09:36] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 18:09:38] [valid] Ep. 1 : Up. 25000 : ce-mean-words : 2.14763 : new best [2019-08-14 18:09:44] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 18:09:47] [valid] Ep. 1 : Up. 25000 : perplexity : 8.5645 : new best [2019-08-14 18:13:37] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 18:13:40] [valid] Ep. 1 : Up. 25000 : translation : 25.34 : new best [2019-08-14 18:19:13] Ep. 1 : Up. 26000 : Sen. 11,236,544 : Cost 3.32940030 : Time 596.65s : 16186.05 words/s : L.r. 2.3534e-04 [2019-08-14 18:24:45] Ep. 1 : Up. 27000 : Sen. 11,663,576 : Cost 3.30576849 : Time 332.28s : 29219.47 words/s : L.r. 2.3094e-04 [2019-08-14 18:30:17] Ep. 1 : Up. 28000 : Sen. 12,099,503 : Cost 3.28457904 : Time 332.27s : 29206.43 words/s : L.r. 2.2678e-04 [2019-08-14 18:35:50] Ep. 1 : Up. 29000 : Sen. 12,535,422 : Cost 3.26705837 : Time 333.12s : 29155.51 words/s : L.r. 2.2283e-04 [2019-08-14 18:41:24] Ep. 1 : Up. 30000 : Sen. 12,970,064 : Cost 3.25107145 : Time 333.26s : 29033.30 words/s : L.r. 2.1909e-04 [2019-08-14 18:41:24] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 18:41:27] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 18:41:30] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 18:41:43] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 18:41:46] [valid] Ep. 1 : Up. 30000 : ce-mean-words : 2.04728 : new best [2019-08-14 18:41:53] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 18:41:55] [valid] Ep. 1 : Up. 30000 : perplexity : 7.74683 : new best [2019-08-14 18:45:39] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 18:45:42] [valid] Ep. 1 : Up. 30000 : translation : 26.25 : new best [2019-08-14 18:51:14] Ep. 1 : Up. 31000 : Sen. 13,403,037 : Cost 3.23336005 : Time 590.54s : 16521.07 words/s : L.r. 2.1553e-04 [2019-08-14 18:56:47] Ep. 1 : Up. 32000 : Sen. 13,837,498 : Cost 3.22317600 : Time 332.36s : 29167.47 words/s : L.r. 2.1213e-04 [2019-08-14 19:02:20] Ep. 1 : Up. 33000 : Sen. 14,272,712 : Cost 3.21346664 : Time 333.69s : 29228.31 words/s : L.r. 2.0889e-04 [2019-08-14 19:07:52] Ep. 1 : Up. 34000 : Sen. 14,702,637 : Cost 3.20045567 : Time 332.20s : 29039.14 words/s : L.r. 2.0580e-04 [2019-08-14 19:13:26] Ep. 1 : Up. 35000 : Sen. 15,141,594 : Cost 3.18303347 : Time 333.22s : 29171.15 words/s : L.r. 2.0284e-04 [2019-08-14 19:13:26] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 19:13:29] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 19:13:32] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 19:13:46] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 19:13:48] [valid] Ep. 1 : Up. 35000 : ce-mean-words : 1.97774 : new best [2019-08-14 19:13:54] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 19:13:57] [valid] Ep. 1 : Up. 35000 : perplexity : 7.22638 : new best [2019-08-14 19:17:51] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 19:17:54] [valid] Ep. 1 : Up. 35000 : translation : 26.98 : new best [2019-08-14 19:23:26] Ep. 1 : Up. 36000 : Sen. 15,567,649 : Cost 3.17901897 : Time 600.46s : 16035.94 words/s : L.r. 2.0000e-04 [2019-08-14 19:28:59] Ep. 1 : Up. 37000 : Sen. 16,005,163 : Cost 3.16255951 : Time 333.06s : 29264.98 words/s : L.r. 1.9728e-04 [2019-08-14 19:34:33] Ep. 1 : Up. 38000 : Sen. 16,434,421 : Cost 3.15619040 : Time 333.69s : 28981.12 words/s : L.r. 1.9467e-04 [2019-08-14 19:40:06] Ep. 1 : Up. 39000 : Sen. 16,869,501 : Cost 3.14585876 : Time 333.45s : 28927.82 words/s : L.r. 1.9215e-04 [2019-08-14 19:45:39] Ep. 1 : Up. 40000 : Sen. 17,299,517 : Cost 3.14065385 : Time 333.17s : 28857.12 words/s : L.r. 1.8974e-04 [2019-08-14 19:45:39] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 19:45:43] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 19:45:47] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 19:46:00] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 19:46:03] [valid] Ep. 1 : Up. 40000 : ce-mean-words : 1.92773 : new best [2019-08-14 19:46:09] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 19:46:12] [valid] Ep. 1 : Up. 40000 : perplexity : 6.87392 : new best [2019-08-14 19:50:01] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 19:50:04] [valid] Ep. 1 : Up. 40000 : translation : 27.47 : new best [2019-08-14 19:55:38] Ep. 1 : Up. 41000 : Sen. 17,735,126 : Cost 3.12336731 : Time 598.70s : 16245.60 words/s : L.r. 1.8741e-04 [2019-08-14 20:01:12] Ep. 1 : Up. 42000 : Sen. 18,167,700 : Cost 3.12053299 : Time 334.24s : 28995.10 words/s : L.r. 1.8516e-04 [2019-08-14 20:06:46] Ep. 1 : Up. 43000 : Sen. 18,595,935 : Cost 3.11462998 : Time 333.61s : 29069.59 words/s : L.r. 1.8300e-04 [2019-08-14 20:12:18] Ep. 1 : Up. 44000 : Sen. 19,030,271 : Cost 3.10356069 : Time 332.21s : 29121.34 words/s : L.r. 1.8091e-04 [2019-08-14 20:17:47] Ep. 1 : Up. 45000 : Sen. 19,464,326 : Cost 3.10118794 : Time 328.89s : 29393.89 words/s : L.r. 1.7889e-04 [2019-08-14 20:17:47] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 20:17:51] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 20:17:54] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 20:18:08] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 20:18:11] [valid] Ep. 1 : Up. 45000 : ce-mean-words : 1.88996 : new best [2019-08-14 20:18:17] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 20:18:21] [valid] Ep. 1 : Up. 45000 : perplexity : 6.61913 : new best [2019-08-14 20:22:03] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 20:22:07] [valid] Ep. 1 : Up. 45000 : translation : 27.86 : new best [2019-08-14 20:27:38] Ep. 1 : Up. 46000 : Sen. 19,896,865 : Cost 3.09044385 : Time 590.47s : 16412.72 words/s : L.r. 1.7693e-04 [2019-08-14 20:33:07] Ep. 1 : Up. 47000 : Sen. 20,326,121 : Cost 3.08517575 : Time 329.21s : 29284.41 words/s : L.r. 1.7504e-04 [2019-08-14 20:38:39] Ep. 1 : Up. 48000 : Sen. 20,757,964 : Cost 3.08068657 : Time 332.40s : 28818.26 words/s : L.r. 1.7321e-04 [2019-08-14 20:44:12] Ep. 1 : Up. 49000 : Sen. 21,189,484 : Cost 3.07408929 : Time 333.21s : 29174.55 words/s : L.r. 1.7143e-04 [2019-08-14 20:49:45] Ep. 1 : Up. 50000 : Sen. 21,623,640 : Cost 3.07062435 : Time 332.32s : 29025.05 words/s : L.r. 1.6971e-04 [2019-08-14 20:49:45] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 20:49:49] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 20:49:52] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 20:50:06] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 20:50:09] [valid] Ep. 1 : Up. 50000 : ce-mean-words : 1.85912 : new best [2019-08-14 20:50:15] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 20:50:18] [valid] Ep. 1 : Up. 50000 : perplexity : 6.41809 : new best [2019-08-14 20:53:59] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 20:54:02] [valid] Ep. 1 : Up. 50000 : translation : 28.23 : new best [2019-08-14 20:59:35] Ep. 1 : Up. 51000 : Sen. 22,055,283 : Cost 3.05950332 : Time 590.78s : 16367.92 words/s : L.r. 1.6803e-04 [2019-08-14 21:05:09] Ep. 1 : Up. 52000 : Sen. 22,487,320 : Cost 3.05860353 : Time 333.43s : 29001.64 words/s : L.r. 1.6641e-04 [2019-08-14 21:10:40] Ep. 1 : Up. 53000 : Sen. 22,916,589 : Cost 3.05387068 : Time 331.30s : 28959.10 words/s : L.r. 1.6483e-04 [2019-08-14 21:16:13] Ep. 1 : Up. 54000 : Sen. 23,348,763 : Cost 3.04580259 : Time 332.33s : 29114.50 words/s : L.r. 1.6330e-04 [2019-08-14 21:21:44] Ep. 1 : Up. 55000 : Sen. 23,778,911 : Cost 3.04406929 : Time 331.78s : 29113.74 words/s : L.r. 1.6181e-04 [2019-08-14 21:21:44] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz [2019-08-14 21:21:48] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz [2019-08-14 21:21:51] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz [2019-08-14 21:22:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz [2019-08-14 21:22:08] [valid] Ep. 1 : Up. 55000 : ce-mean-words : 1.83448 : new best [2019-08-14 21:22:14] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz [2019-08-14 21:22:17] [valid] Ep. 1 : Up. 55000 : perplexity : 6.26187 : new best [2019-08-14 21:26:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz [2019-08-14 21:26:09] [valid] Ep. 1 : Up. 55000 : translation : 28.47 : new best [2019-08-14 21:29:53] Error: Error reading from file '.' [2019-08-14 21:29:53] Error: Aborted from marian::io::InputFileStream& marian::io::getline(marian::io::InputFileStream&, std::__cxx11::string&) in /media/wangxiuwan/marian-dev/src/common/file_stream.h:216

[CALL STACK] [0x5cd0e2]
[0x5ce1c8]
[0x5bc0cf]
[0x51e62d]
[0x51f67b]
[0x52005e]
[0x441459]
[0x7f5dff4c1a99] + 0xea99 [0x43ad72]
[0x442881]
[0x46b684]
[0x7f5e0c686678] + 0xb8678 [0x7f5dff4ba6ba] + 0x76ba [0x7f5dfecdf41d] clone + 0x6d

`

frankseide commented 5 years ago

Can we first establish where the “error reading from” comes from? I suspect it is reading a chopped file where the last line does not end in a newline character, and (hopefully) the code keeps reading until it finds a newline. We may need to temporarily change the code to not delete the tmp file, so that we can inspect it.

Then we should establish why the file is truncated. Is the disk full during write, but we don’t catch that, or is it some strange OS-level caching that messes things up (Windows used to cache data written to network and sometimes failed to flush the cache, causing corrupt files; maybe Linux or your version of drivers does some nasty things like this as well).

wangxw1023 commented 5 years ago

@frankseide I think what you said is very reasonable. Let me try. Can you tell me where is the code to delete temp files and what change is I need to do? I have search global in marian's code, but I am not sure where to make changes.

wangxw1023 commented 5 years ago

@frankseide Maybe you can tell me what is the rule for judging the terminator， how to judge the last line does end in a newline character? I can check my corpus, is there a sentence pair with no terminator?

frankseide commented 5 years ago

The file should end in a LF character, ASCII code 10, hex 0x0a.

wangxw1023 commented 5 years ago

@frankseide Thank you very much. And you have said that we may Need to temporarily change the code to not delete the tmp file, so that we can inspect it. I would like to ask you where to change? Is it setting the trainging paramters: --shuffle-in-ram Keep shuffled corpus in RAM, do not write to temp file

emjotde commented 5 years ago

Hi again. Interesting. I have a couple of questions:

As you said, can you try using --shuffle-in-ram. That will not use the temporary file and if the error does not occur again we have a hint that it is indeed the temporary file.
Can you post your full compilation command? I cannot understand why your stack trace does not contain function names etc. That should be compiled in by default and would make error diagnosis a bit easier.
Can you run df -h /tmp before and during training and post the results?

snukky commented 5 years ago

A comment to the first question: the --sqlite option should also skip using temporary files, and use a SQLite DB file for storing and shuffling the data.

emjotde commented 5 years ago

Let's ignore --sqlite for the moment. By default the sqlite database is created also in the temporary folder and it's larger than the raw text. So if there are some weird hidden space problems that might not really help.

wangxw1023 commented 5 years ago

@emjotde @snukky Thank you for both of your reply. Currently, I have restarted the training with --shuffle-in-ram. It seems normal for three hours. I'm not sure whether the error would appear again. about emjotde's two other questions:

compilation command:

mkdir build cd build cmake .. -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.2 \ -DOPENSSL_ROOT_DIR=/usr/local/ssl -DOPENSSL_LIBRARIES=/usr/local/ssl/lib \ -DBOOST_ROOT=/media/wangxiuwan/boost_1_65_1 make -j

df -h /tmp: before and during training is the same.

(base) work@dbcloud-Super-Server:~$ df -h /tmp Filesystem Size Used Avail Use% Mounted on /dev/sda3 3.5T 2.5T 822G 76% /

emjotde commented 5 years ago

OK, this is a lot of space. So it should not be a space problem. Can you please add -DCMAKE_BUILD_TYPE=Release to your cmake command like this:

cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.2 
-DOPENSSL_ROOT_DIR=/usr/local/ssl -DOPENSSL_LIBRARIES=/usr/local/ssl/lib 
-DBOOST_ROOT=/media/wangxiuwan/boost_1_65_1

We made this the default yesterday, but you might not have that version yet. This will compile with function names and the stack trace should be more informative.

kpu commented 5 years ago

In any case, no space should have triggered ENOSPC on write, assuming proper error checking.

emjotde commented 5 years ago

"assuming proper error checking", well, that's a strong assumption :)

emjotde commented 5 years ago

One bug is that we do not set the file path in the input stream when handing in a temporary file. That at least explains why the error message is saying '.' (default Pathie path) instead of a proper temporary filename from tempnam. So this seems to make Frank's ideas more likely.

emjotde commented 5 years ago

I'll add an option later today to keep temporary files and fix the name issue. Will let you when it's ready to try.

emjotde commented 5 years ago

Branch tempfile now has an option --keep-temp which will keep the temporary files inside the folder instead of unlinking. I also fixed the name handling, so the error should tell you now which of the temporary files failed. We should probably also add a line counter and add it to the error message.

wangxw1023 commented 5 years ago

@emjotde Hi, until now, the training is still going on and no error is reported. It seems that the error is really related to temp files. When the training is completed, I will try the cmake method (add -DCMAKE_BUILD_TYPE=Release) and Branch tempfile, and I will post the result.

wangxw1023 commented 5 years ago

Hi, the training is still going on. But I think it's a bit strange，Why did the training just arrive at epoch 2, and bleu won't rise? I plan to train for 7 days, but this is only two days. Can you give me some advice on the reasons?

train.log

emjotde commented 5 years ago

Looks quite normal to me. That's a lot of iterations, I would not expect the score to improve. Do you have reason do believe that the results are bad?

wangxw1023 commented 5 years ago

@emjotde Hi, emjotde I fell so sorry to tell you that as our server hard drive was broken some time ago, and we lost a lot of corpus, including this issues's related corpus. So branch tempfile's work only can started until we generate the corpus again, which may take a long time. When there is a result, I will upload it.

And about the train log, we both trained the transformer model with marian and tensor2tensor, which used the same corpus. However, the max bleu of marian is 32 while the max bleu of tensor2tensor is 41. Then I believe the training results of marian are bad.

The following is the training curve of tensor2tensor.

emjotde commented 4 years ago

Closing this now due to inactivity (ours). Feel free to reopen. Usually we do not have problems to match T2T performance, no idea where that would come from. On the other hand there were a few bugs in the marian-dev code around that time. Maybe that has solved itself.

marian-nmt / marian-dev

Error: Error reading from file '.' #480