marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.25k stars 233 forks source link

cost equals nan when training model with marian #120

Closed 520jefferson closed 7 years ago

520jefferson commented 7 years ago

[2017-10-14 02:03:19] Ep. 2 : Up. 116760 : Sen. 6962176 : Cost 111.24 : Time 3.57s : 11717.49 words/s [2017-10-14 02:03:22] Ep. 2 : Up. 116770 : Sen. 6964736 : Cost 145.96 : Time 3.45s : 14980.89 words/s [2017-10-14 02:03:26] Ep. 2 : Up. 116780 : Sen. 6967296 : Cost 139.78 : Time 4.01s : 12567.44 words/s [2017-10-14 02:03:30] Ep. 2 : Up. 116790 : Sen. 6969856 : Cost 87.32 : Time 3.71s : 9522.98 words/s [2017-10-14 02:03:33] Ep. 2 : Up. 116800 : Sen. 6972416 : Cost nan : Time 2.74s : 19413.05 words/s [2017-10-14 02:03:36] Ep. 2 : Up. 116810 : Sen. 6974976 : Cost nan : Time 3.23s : 13540.76 words/s [2017-10-14 02:03:40] Ep. 2 : Up. 116820 : Sen. 6977536 : Cost nan : Time 3.95s : 13095.72 words/s [2017-10-14 02:03:43] Ep. 2 : Up. 116830 : Sen. 6980096 : Cost nan : Time 3.20s : 11438.62 words/supdate when training update to 116800 times,the cost equals nan,and the model is invalid,does someone meet the same problem? PS:the scale of corpus is about 24m with multi-gpu @emjotde

emjotde commented 7 years ago

Hi, interesting, can you post your configuration or command line?

I believe this might already be fixed in one of our experimental branches (I had some problems with instable softmax there). Since the command line options are changing a bit as we are approaching to realease version 1.0 I could provide you with an updated config or command line invocation.

520jefferson commented 7 years ago

Hi, sorry for late reply. my configuration as follows: C2E confiuration: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train.src.ch.bpe train.tar.en.bpe \ --vocabs train.src.ch.bpe.pkl.json train.tar.en.bpe .pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --after-batches 20000000

E2C configuration: ../../build/marian \ --type s2s \ --model models_s2s/512-1024-en_ch.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train.src.en.bpe train.tar.ch.bpe \ --vocabs train.src.en.bpe.pkl.json train.tar.ch.bpe .pkl.json \10
--dim-vocabs 31800 48550 \ --dec-cell-base-depth 2 \ --dec-cell-high-depth 2 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --after-batches 20000000

@emjotde I can try with your updated config or command line.

emjotde commented 7 years ago

OK, can you try master branch from http://github.com/marian-nmt/marian-dev with that config? It should not have the NaN problem any more.

However, I notice your learning rate is really high which is probably a secondary reason for the NaN to appear. Shortly before that happens the costs start fluctuating which probably results in overflows in the softmax. With a lower learning rate (the default is 0.0001) this is a lot less likely to happen.

You may also want to try --mini-batch-fit (previously --dynamic-batching). With this options it tries to adapt the mini-batch size to the available work space memory which you can fix for instance with --workspace 5000 or a larger number.

520jefferson commented 7 years ago

hi,@emjotde my configuration as follows: C2E: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train.src.ch.bpe train.tar.en.bpe \ --vocabs train.src.ch.bpe.pkl.json train.tar.en.bpe .pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.0001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 10000 \ --after-batches 20000000

E2C: ../../build/marian \ --type s2s \ --model models_s2s/512-1024-en_ch.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train.src.en.bpe train.tar.ch.bpe \ --vocabs train.src.en.bpe.pkl.json train.tar.ch.bpe .pkl.json \ --dim-vocabs 31800 48550 \ --dec-cell-base-depth 2 \ --dec-cell-high-depth 2 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.0001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 10000 \ --after-batches 20000000

when i already start train process,i met critical errors: nvidia-smi info,and my total memory is 22912MiB each GPU memory: | 4 17442 C Unknown Error 11609MiB | | 4 17570 C ../../build/marian 11077MiB | | 5 17442 C Unknown Error 11609MiB | | 5 17570 C ../../build/marian 11077MiB | | 6 17442 C Unknown Error 11609MiB |

C2E training: [2017-10-16 10:07:41] [memory] Reserving 232 MB, device 6 [2017-10-16 10:07:41] [memory] Reserving 232 MB, device 5 [2017-10-16 10:07:44] Ep. 1 : Up. 10 : Sen. 4121 : Cost 214.60 : Time 79.92s : 1038.37 words/s [2017-10-16 10:07:48] Ep. 1 : Up. 20 : Sen. 10240 : Cost 172.57 : Time 4.32s : 23102.98 words/s [2017-10-16 10:07:52] Ep. 1 : Up. 30 : Sen. 15360 : Cost 182.24 : Time 3.54s : 25759.34 words/s [2017-10-16 10:07:55] Ep. 1 : Up. 40 : Sen. 20050 : Cost 156.31 : Time 3.85s : 21480.16 words/s [2017-10-16 10:08:00] Ep. 1 : Up. 50 : Sen. 24880 : Cost 160.90 : Time 4.01s : 23223.76 words/s [2017-10-16 10:08:03] Ep. 1 : Up. 60 : Sen. 29804 : Cost 112.70 : Time 3.62s : 20172.80 words/s [2017-10-16 10:08:07] Ep. 1 : Up. 70 : Sen. 34260 : Cost 159.92 : Time 3.67s : 25135.64 words/s [2017-10-16 10:08:10] Ep. 1 : Up. 80 : Sen. 38950 : Cost 107.33 : Time 3.19s : 21998.28 words/s [2017-10-16 10:08:14] Ep. 1 : Up. 90 : Sen. 43960 : Cost 146.55 : Time 4.18s : 23579.54 words/s [2017-10-16 10:08:18] Ep. 1 : Up. 100 : Sen. 49080 : Cost 134.35 : Time 3.58s : 26416.84 words/s [2017-10-16 10:08:22] Ep. 1 : Up. 110 : Sen. 53817 : Cost 139.70 : Time 4.50s : 20401.84 words/s [2017-10-16 10:08:27] Ep. 1 : Up. 120 : Sen. 59700 : Cost 121.67 : Time 4.37s : 22746.39 words/s [2017-10-16 10:08:36] Ep. 1 : Up. 130 : Sen. 65410 : Cost 122.33 : Time 9.22s : 10515.02 words/s [2017-10-16 10:08:45] Ep. 1 : Up. 140 : Sen. 70710 : Cost 131.00 : Time 9.23s : 10338.31 words/s terminate called recursively terminate called after throwing an instance of 'terminate called recursively util::Exception' Aborted (core dumped)

then E2C training: [2017-10-16 10:12:26] Ep. 1 : Up. 610 : Sen. 266800 : Cost 138.25 : Time 3.72s : 23543.23 words/s [2017-10-16 10:12:30] Ep. 1 : Up. 620 : Sen. 271020 : Cost 115.07 : Time 3.27s : 21315.96 words/s [2017-10-16 10:12:33] Ep. 1 : Up. 630 : Sen. 275300 : Cost 146.81 : Time 3.38s : 25377.85 words/s [2017-10-16 10:12:37] Ep. 1 : Up. 640 : Sen. 279380 : Cost 133.66 : Time 3.87s : 19772.79 words/s [2017-10-16 10:12:41] Ep. 1 : Up. 650 : Sen. 283600 : Cost 148.91 : Time 3.75s : 23125.39 words/s [2017-10-16 10:12:44] Ep. 1 : Up. 660 : Sen. 288470 : Cost 110.06 : Time 3.66s : 20935.18 words/s [2017-10-16 10:12:48] Ep. 1 : Up. 670 : Sen. 293440 : Cost 112.88 : Time 3.33s : 23985.58 words/s [2017-10-16 10:12:52] Ep. 1 : Up. 680 : Sen. 297520 : Cost 171.12 : Time 4.08s : 23287.03 words/s [2017-10-16 10:12:55] Ep. 1 : Up. 690 : Sen. 301740 : Cost 120.48 : Time 3.56s : 20345.47 words/s [2017-10-16 10:12:59] Ep. 1 : Up. 700 : Sen. 306100 : Cost 118.58 : Time 3.47s : 21118.39 words/s [2017-10-16 10:13:02] Ep. 1 : Up. 710 : Sen. 310660 : Cost 148.38 : Time 3.66s : 25289.84 words/s [2017-10-16 10:13:06] Ep. 1 : Up. 720 : Sen. 314800 : Cost 138.74 : Time 3.95s : 20179.91 words/s [2017-10-16 10:13:09] Ep. 1 : Up. 730 : Sen. 319720 : Cost 101.57 : Time 3.19s : 22914.85 words/s [2017-10-16 10:13:13] Ep. 1 : Up. 740 : Sen. 323880 : Cost 156.47 : Time 3.71s : 24179.30 words/s [2017-10-16 10:13:17] Ep. 1 : Up. 750 : Sen. 328020 : Cost 139.17 : Time 3.83s : 20829.00 words/s [2017-10-16 10:13:20] Ep. 1 : Up. 760 : Sen. 332890 : Cost 114.52 : Time 3.46s : 23058.13 words/s [2017-10-16 10:13:24] Ep. 1 : Up. 770 : Sen. 336820 : Cost 137.83 : Time 3.40s : 22112.68 words/s [2017-10-16 10:13:27] Ep. 1 : Up. 780 : Sen. 340750 : Cost 139.18 : Time 3.59s : 21062.06 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: an illegal memory access was encountered marian-dev/src/tensors/tensor.cu 25 Aborted (core dumped)

when i change workspace to 8000 ,the error was the same. when i dropout the workspace and --mini-batch-fit, same errors show. why? get gpu memory error?

emjotde commented 7 years ago

I understand both processes die at the same time? I would still guess that might be an insufficient space issue due to both processes. Maybe it tried to reallocate something and then ran out of memory.

I do not think it is a good idea to run two processes on one GPU anyway, maybe use a larger batch instead and keep one process per GPU?

520jefferson commented 7 years ago

I can try what you say ,but when one process die,the other also die because of memory.i think if it's cased by memory,the later will train normal after the former died. right?

Second,i put the two training processing in the same gpus using marian-nmt/marian,both running normally, and i set the learning_rate to 0.0001 just as you say,and both using dynamic-batching and workspace,if one is running,the other's workspace is not enough ,the log will show the info circularly and won't crash: [2017-10-16 14:26:58] [memory] Reserving 348 MB, device 4 [2017-10-16 14:26:59] [memory] Reserving 348 MB, device 5 [2017-10-16 14:26:59] [memory] Reserving 348 MB, device 6 if the other's workspace is enough both running normal.

emjotde commented 7 years ago

1) Not necessarily, if they happen to both re-allocate at the same time they can choke each other. Although I would generally say training two models on GPU is rather not recommended. It is certainly untested for Marian.

2) I still recommend to try the current marian version from https://github.com/marian-nmt/marian-dev it has the more stable softmax. The --dynamic-batching option has been renamed to --mini-batch-fit there.

520jefferson commented 7 years ago

1、i set --dec-cell-base-depth 2 --dec-cell-high-depth 2 when training type is s2s,so s2s decode slower than amun ,because net structure ?

2、when translate amun -d 3, actually it will take gpu 0 3 ,confused why gpu0 is used?

3、when i set --mini-batch-fit --workspace 18000 the gpu shows: +-------------------------------+----------------------+----------------------+ | 7 Tesla P40 Off | 0000:0F:00.0 Off | 0 | | N/A 53C P0 190W / 250W | 19333MiB / 22912MiB | 99% Default | +-------------------------------+----------------------+----------------------+ logs: [2017-10-18 10:34:22] [memory] Extending reserved space to 18432 MB (device 7) [2017-10-18 10:34:22] [memory] Reserving 348 MB, device 7 [2017-10-18 10:34:23] [memory] Reserving 348 MB, device 7 waiting about 20 minute and doing nothing,is it normal?

emjotde commented 7 years ago

1) Multiple reasons: s2s is a bit slower than amun, about 10% with comparable settings. Amun is hand coded for one architecture, s2s is general. Amun has batched decoding, so it can translate multiple sentences at once, we are working to add this to s2s (--mini-batch >1 for Amun activates this). And it will of course be caused by the architecture.

2) Bug in amun. Difficult to fix for us as we don't see it on our machines, but has been reported multiple times.

3) I think it is collecting statistics for batch-fitting here. With that large amount of memory that may take a while as it is increasing the batch-size for different sentence lengths and checking whether they fit. What did you set --disp-freq to?

hieuhoang commented 7 years ago
  1. If you can provide the model, I'll look into it. If you can provide access to the same machine that you're seeing the bug, even better. I can't seem to replicate it on a 2 GPU machine
520jefferson commented 7 years ago

@emjotde 3、i set --disp-freq 10 .whether i should give so big workspace?although training process will take such a fixed space,but it's won't use so big ? (i set --max-len 50)

emjotde commented 7 years ago

It will fill up the space with a larger number of sentences. So it might use an average mini-batch size of 500 or more.

520jefferson commented 7 years ago

@hieuhoang sorry, i didn't save the model after a new round of training and i will share with you when i met the problem. there exists security licence in the company,i cann't provide acess to you ,sorry. By increasing the learning_rate and more corpus and iteration will help?

emjotde commented 7 years ago

Sorry, I did not get the last questions?

520jefferson commented 7 years ago

you means mini-batch equals maxlen*dispy_freq?i always think the disp-freq is just helpful to see cost.

the last question is about the relationship between workspace and mini-batch, if the min-batch (i set 256) lower,then the utilization ratio will low and workspace won't be used sufficiently?

i also training with nematus (https://github.com/EdinburghNLP/nematus) ,the same corpus and setting. i get value of bleu meter as follows nematus c2e:0.4058 | 0.610273 nematus e2c:0.5152 | 0.60306 marian-master:0.3845 | 0.560889 marian-master:0.4549 | 0.537081 the marian training so fast comparing to nematus.but why the bleu and meter is lower. what i can do to improve the bleu and meter?

emjotde commented 7 years ago

1) No, there is no relation between disp-freq and mini-batch size, as you said it is only for display.

2) Can you post your complete Nematus configs and marian configs for both experiments? Otherwise I cannot tell where the differences are. The models are mathematically basically the same, there should be no such differences for the same settings.

520jefferson commented 7 years ago

@emjotde as you say,the config as follows,but we change the nematus to support multi-gpus. c2e nematus:

    train(train_len=100,
          dim_word=512,  # word vector dimensionality
          dim=1024,  # the number of LSTM units
          factors=1, # input factors
          dim_per_factor=None, # list of word vector dimensionalities (one per factor): [250,200,50] for total dimensionality of 500
          encoder='gru',
          decoder='gru_cond',
          patience=10,  # early stopping patience
          max_epochs=8000,
          finish_after=20000000,  # finish after this many updates
          dispFreq=10,
          decay_c=0.,  # L2 regularization penalty
          map_decay_c=0., # L2 regularization penalty towards original weights
          clip_c=5.,  # gradient clipping threshold
          lrate=0.001,  # learning rate
          n_words_src=48550,  # source vocabulary size
          n_words=31800,  # target vocabulary size
          maxlen=50,  # maximum length of the description
          optimizer='adam',
          batch_size=128,
          valid_batch_size=16,
          saveto='models_1010/512-1024-ch_en',
          validFreq=0,
          saveFreq=2000,   # save the parameters after every saveFreq updates
          sampleFreq=2000,   # generate some samples after every sampleFreq
          datasets=[
              home + 'train0825.src.ch.0926.filter.bpe_integrate.shuf',
              home + 'train0825.tar.en.0926.filter.bpe_integrate.shuf'],
          valid_datasets=['../data/dev/newstest2011.en.tok',
                          '../data/dev/newstest2011.fr.tok'],
          dictionaries=[
              home + 'train0825.src.ch.0926.filter.bpe_integrate.pkl',
              home + 'train0825.tar.en.0926.filter.bpe_integrate.pkl'],
          use_dropout=True,
          dropout_embedding=0.2, # dropout for input embeddings (0: no dropout)
          dropout_hidden=0.2, # dropout for hidden layers (0: no dropout)
          dropout_source=0, # dropout source words (0: no dropout)
          dropout_target=0, # dropout target words (0: no dropout)
          reload_=True,
          reload_training_progress=True, # reload trainig progress (only used if reload_ is True)
          overwrite=False,
          external_validation_script=None,
          shuffle_each_epoch=True,
          sort_by_length=True,
          use_domain_interpolation=False, # interpolate between an out-domain training corpus and an in-domain training corpus
          domain_interpolation_min=0.1, # minimum (initial) fraction of in-domain training data
          domain_interpolation_max=1.0, # maximum fraction of in-domain training data
          domain_interpolation_inc=0.1, # interpolation increment to be applied each time patience runs out, until maximum amount of interpolation is reached
          domain_interpolation_indomain_datasets=['indomain.en', 'indomain.fr'], # in-domain parallel training corpus
          maxibatch_size=20, #How many minibatches to load at one time
          objective="CE", #CE: cross-entropy; MRT: minimum risk training (see https://www.aclweb.org/anthology/P/P16/P16-1159.pdf)
          mrt_alpha=0.005,
          mrt_samples=100,
          mrt_samples_meanloss=10,
          mrt_reference=True,
          mrt_loss="SENTENCEBLEU n=4", # loss function for minimum risk training
          mrt_ml_mix=0.5, # interpolate mrt loss with ML loss
          model_version=0.1, #store version used for training for compatibility
          prior_model=None, # Prior model file, used for MAP
          tie_encoder_decoder_embeddings=False, # Tie the input embeddings of the encoder and the decoder (first factor only)
          tie_decoder_embeddings=False, # Tie the input embeddings of the decoder with the softmax output embeddings
          encoder_truncate_gradient=-1, # Truncate BPTT gradients in the encoder to this value. Use -1 for no truncation
          decoder_truncate_gradient=-1, # Truncate BPTT gradients in the decoder to this value. Use -1 for no truncation
    )

c2e marian master:

../../build/marian \
        --type amun \
        --model models_amun/512-1024-ch_en.npz \
        --devices $@ \
        --seed 0 \
        --dim-emb 512 \
        --dim-rnn 1024 \
        --train-set train0825.src.ch.0926.filter.bpe_integrate   train0825.tar.en.0926.filter.bpe_integrate \
        --vocabs train0825.src.ch.0926.filter.bpe_integrate.pkl.json train0825.tar.en.0926.filter.bpe_integrate.pkl.json \
        --dim-vocabs 48550 31800 \
        --disp-freq  10 \
        --save-freq  2000 \
        --learn-rate 0.0001 \
        --max-length 50 \
        --optimizer adam \
        --mini-batch 256 \
        --maxi-batch 20 \
        --dropout-rnn 0.2 \
        --dropout-src 0 \
        --dropout-trg 0 \
        --tempdir tmp \
        --dynamic-batching \
        --workspace 9000 \
    --after-batches 20000000
emjotde commented 7 years ago

Hm, the main differences I am seeing are:

clip_c=5., # gradient clipping threshold
lrate=0.001, # learning rate

So, you would get the same by setting --learn-rate 0.001 --clip-norm 5 for marian. If the large learning rate worked for Nematus it should also be fine for Marian with the fixed softmax in marian-dev.

How long did you train (how many epochs, updates)? In a multi-gpu setting convergence can be a little slower in the beginning than using one GPU.

520jefferson commented 7 years ago

i train with nematus about 8 days. i train the with marian-master 2 days and 4 hours . the bleu of them is hard to be more high.

now the display info for marian is: [2017-10-18 18:14:31] Ep. 7 : Up. 306350 : Sen. 5630845 : Cost 34.46 : Time 6.91s : 13596.73 words/s [2017-10-18 18:14:37] Ep. 7 : Up. 306360 : Sen. 5635375 : Cost 41.04 : Time 6.20s : 14970.67 words/s [2017-10-18 18:14:43] Ep. 7 : Up. 306370 : Sen. 5640755 : Cost 30.26 : Time 5.84s : 14396.02 words/s [2017-10-18 18:14:48] Ep. 7 : Up. 306380 : Sen. 5645115 : Cost 31.29 : Time 5.46s : 13058.30 words/s [2017-10-18 18:14:54] Ep. 7 : Up. 306390 : Sen. 5650185 : Cost 34.51 : Time 6.16s : 14634.79 words/s [2017-10-18 18:15:01] Ep. 7 : Up. 306400 : Sen. 5654715 : Cost 39.17 : Time 6.18s : 14526.57 words/s [2017-10-18 18:15:07] Ep. 7 : Up. 306410 : Sen. 5659455 : Cost 31.46 : Time 6.43s : 11930.19 words/s [2017-10-18 18:15:13] Ep. 7 : Up. 306420 : Sen. 5664195 : Cost 33.37 : Time 5.54s : 14815.46 words/s

emjotde commented 7 years ago

That should be long enough. I cannot really infere anything from that. How do you call the translation processes for both, Nematus and Marian?

520jefferson commented 7 years ago

yes,that's what i mean ,now i'm trying to train with marian-dev,and change the lrate=0.001 and clip_c=5. as you say.

emjotde commented 7 years ago

What I meant: Could you also post the commands you use to translate for both?

520jefferson commented 7 years ago

sorry for missing that. nematus:

Decoder=${HOME}/work/nmt/tools/amunmt-master0330/build/bin/amun
cat c2e.yml > tmp.yml
echo "    path: "${ModelPath}".iter"${iter}.npz >> tmp.yml
echo "input-file: "${INPUT} >> tmp.yml
${Decoder} -c tmp.yml > ${Testout}.at

c2e.yml:

# Paths are relative to config file location
relative-paths: yes
# performance settings
beam-size: 11
devices: [0]
normalize: yes
gpu:threads: 4
cpu-threads: 0
# scorer weights
weights:
  F0: 1.0
bpe: c2e_0926/train0825.src.ch.0926.filter.codes
debpe: yes
return-alignment: no
#wipo: true
# vocabularies
source-vocab: train0825.src.ch.0926.filter.bpe_integrate.pkl.json
target-vocab: train0825.tar.en.0926.filter.bpe_integrate.pkl.json
scorers:
  F0:
    type: Nematus

marian: c2e:

marian-master/build/amun -d  3  -m  $model   -s  train0825.src.ch.0926.filter.bpe_integrate.pkl.json  -t  train0825.tar.en.0926.filter.bpe_integrate.pkl.json  -i $INPUT > ${Testout}.at

e2c:

marian-master/build/s2s -m $model -d 2   -v train0825.src.en.0926.filter.bpe_integrate.pkl.json   train0825.tar.ch.0926.filter.bpe_integrate.pkl.json -i  $INPUT > ${Testout}.at
emjotde commented 7 years ago

That looks fine. OK, so you are using amun for models trained with Nematus and trained with Marian (for shallow models). So the difference is not coming from there, right?

Hm. Let's see what the new experiments do. I suppose your data is not public and you cannot share for testing.

With my own experiments I usually have near identical results to Nematus, for instance we had no problems reproducing Edinburgh's WMT 2017 systems with similar or slighlty higher results.

520jefferson commented 7 years ago

i do not understand completly (using amun for models trained with Nematus and trained with Marian) to nematus, i use the nematus training and amun decode. to marian, i use the tool you provide to train and decode

the data is not public and i cannot get it out for some reason.

i can try some new experiments with your help.

emjotde commented 7 years ago

OK, what is your command for translating with Nematus?

520jefferson commented 7 years ago

nematus: Decoder=${HOME}/work/nmt/tools/amunmt-master0330/build/bin/amun cat c2e.yml > tmp.yml echo " path: "${ModelPath}".iter"${iter}.npz >> tmp.yml echo "input-file: "${INPUT} >> tmp.yml ${Decoder} -c tmp.yml > ${Testout}.at that's not what you want?

emjotde commented 7 years ago

I was confused, because amun is something we provide too :)

You can use amun for the marian-trained models as well as long as you use --type amun, but not for the deep models.

emjotde commented 7 years ago

Are the worse results, for a model that resulted in NaN later?

520jefferson commented 7 years ago

marian-master is so fast so we want to change from nematus's training to marian-master's . so i try to use it. the cost euqals NaN ,it's the first time i trian with marian-master and config like nematus. when translating i use marian-master's tools to decode as you provided in the project‘s README.

i use nematus to train ,but use amun(predecessor of marian-master) because decoder is more fast than nematus's.

emjotde commented 7 years ago

If the problem with the worse quality persists, it would be good to try and confirm it with a public data set so that I can repeat the experiments.

emjotde commented 7 years ago

And for both, Nematus and Marian, it is usually very beneficial to enable layer-normalization. Both have that option. It should result in better convergence and better translation quality.

520jefferson commented 7 years ago

I will try what you say about layer-normalization . And i will do same experiments to check the difference of nematus and marian when the learning_rate and clip_c is the same.

emjotde commented 7 years ago

Actually, I just checked a larger example training and I am experiencing trouble training with multiple GPUs, mentioned in https://github.com/marian-nmt/marian-dev/issues/119 . Maybe let me fix this first before you try new experiments.

520jefferson commented 7 years ago

Ok, thanks, i just train if i find some problem i can share with you.

PS:when the 4 gpus train begins i met the problem. [2017-10-18 20:02:54] Ep. 1 : Up. 4010 : Sen. 7140 : Cost 120.74 : Time 83.27s : 1386.23 words/s [2017-10-18 20:02:58] Ep. 1 : Up. 4020 : Sen. 13750 : Cost 156.24 : Time 4.25s : 29302.56 words/s [2017-10-18 20:03:02] Ep. 1 : Up. 4030 : Sen. 20480 : Cost 155.16 : Time 3.62s : 35186.56 words/s [2017-10-18 20:03:06] Ep. 1 : Up. 4040 : Sen. 27760 : Cost 130.64 : Time 3.99s : 29615.64 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: an illegal memory access was encountered marian-dev/src/tensors/tensor.cu 70 terminate called recursively Aborted (core dumped)

emjotde commented 7 years ago

I believe I fixed that problem a couple of hours ago in marian-dev, can you try again? It might also fix the problem you had above with running more than one process of marian-dev on a single GPU.

520jefferson commented 7 years ago

@emjotde when i update codes to commit:fce8a3b6ffbdc75e603a0a65f56c5f4fec7cbac9 i start the traing process i get this: [2017-10-19 22:52:23] [memory] Reserving 174 MB, device 1 [2017-10-19 22:52:23] [memory] Reserving 174 MB, device 3 [2017-10-19 22:52:25] Ep. 1 : Up. 10 : Sen. 4460 : Cost 167.19 : Time 91.82s : 780.61 words/s [2017-10-19 22:52:28] Ep. 1 : Up. 20 : Sen. 8155 : Cost 157.17 : Time 2.36s : 29833.05 words/s [2017-10-19 22:52:30] Ep. 1 : Up. 30 : Sen. 12670 : Cost 127.62 : Time 2.76s : 28199.68 words/s [2017-10-19 22:52:33] Ep. 1 : Up. 40 : Sen. 16560 : Cost 151.11 : Time 2.65s : 29615.65 words/s [2017-10-19 22:52:36] Ep. 1 : Up. 50 : Sen. 20480 : Cost 115.27 : Time 2.62s : 24470.53 words/s [2017-10-19 22:52:38] Ep. 1 : Up. 60 : Sen. 24940 : Cost 113.70 : Time 2.08s : 35117.11 words/s [2017-10-19 22:52:41] Ep. 1 : Up. 70 : Sen. 28881 : Cost 160.09 : Time 3.26s : 26811.33 words/s [2017-10-19 22:52:43] Ep. 1 : Up. 80 : Sen. 33160 : Cost 94.02 : Time 1.97s : 30898.42 words/s [2017-10-19 22:52:46] Ep. 1 : Up. 90 : Sen. 37001 : Cost 134.16 : Time 2.56s : 29507.47 words/s [2017-10-19 22:52:48] Ep. 1 : Up. 100 : Sen. 41500 : Cost 118.34 : Time 2.77s : 28487.18 words/s [2017-10-19 22:52:51] Ep. 1 : Up. 110 : Sen. 45760 : Cost 95.78 : Time 2.68s : 23694.72 words/s [2017-10-19 22:52:54] Ep. 1 : Up. 120 : Sen. 49590 : Cost 133.20 : Time 2.64s : 29071.47 words/s [2017-10-19 22:52:56] Ep. 1 : Up. 130 : Sen. 53792 : Cost 108.59 : Time 2.20s : 32078.57 words/s [2017-10-19 22:52:59] Ep. 1 : Up. 140 : Sen. 57557 : Cost 129.48 : Time 2.87s : 25614.82 words/s [2017-10-19 22:53:02] Ep. 1 : Up. 150 : Sen. 61770 : Cost 120.47 : Time 2.94s : 26475.86 words/s [2017-10-19 22:53:04] Ep. 1 : Up. 160 : Sen. 65855 : Cost 130.29 : Time 2.33s : 35191.87 words/s [2017-10-19 22:53:06] Ep. 1 : Up. 170 : Sen. 70690 : Cost 86.19 : Time 2.40s : 28417.73 words/s [2017-10-19 22:53:10] Ep. 1 : Up. 180 : Sen. 74830 : Cost 119.75 : Time 3.07s : 25329.12 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: unspecified launch failure marian-dev/src/tensors/tensor.cu 70 Aborted (core dumped)

my configuration: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set corpus/train0825_CE/c2e_0926/train0825.src.ch.0926.filter.bpe_integrate corpus/train0825_CE/c2e_0926/train0825.tar.en.0926.filter.bpe_integrate \ --vocabs corpus/train0825_CE/c2e_0926/train0825.src.ch.0926.filter.bpe_integrate.pkl.json corpus/train0825_CE/c2e_0926/train0825.tar.en.0926.filter.bpe_integrate.pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 9500 \ --clip-norm 5. \ --layer-normalization \ --after-batches 20000000

520jefferson commented 7 years ago

@emjotde Does the problem exist? as soon as i start training process it will be interrupted by the some errors mentioned above.

emjotde commented 7 years ago

Give me a few more days. I am still trying to figure out what is actually going on.

520jefferson commented 7 years ago

Ok,thanks,and are there some design documents for others to learn the codes more quickly?

emjotde commented 7 years ago

I've added a couple of new guards against NaNs and I think we have a bit better error handling/logging now. Might be worth it to try the newest master in marian-dev.

520jefferson commented 7 years ago

@emjotde now i update codes of marian-dev then i start the training program but i met the error again: [2017-10-24 10:08:52] Ep. 1 : Up. 590 : Sen. 308701 : Cost 85.70 : Time 6.02s : 15248.24 words/s [2017-10-24 10:08:59] Ep. 1 : Up. 600 : Sen. 313980 : Cost 92.22 : Time 6.55s : 15170.45 words/s [2017-10-24 10:09:05] Ep. 1 : Up. 610 : Sen. 319450 : Cost 78.50 : Time 5.56s : 16362.45 words/s [2017-10-24 10:09:11] Ep. 1 : Up. 620 : Sen. 324580 : Cost 88.74 : Time 6.43s : 14500.57 words/s [2017-10-24 10:09:17] Ep. 1 : Up. 630 : Sen. 329316 : Cost 88.10 : Time 5.97s : 14635.18 words/s [2017-10-24 10:09:23] Ep. 1 : Up. 640 : Sen. 334180 : Cost 79.02 : Time 6.09s : 13384.21 words/s [2017-10-24 10:09:31] Ep. 1 : Up. 650 : Sen. 340010 : Cost 86.61 : Time 7.47s : 14330.95 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: an illegal memory access was encountered marian-dev/src/tensors/tensor.cu 25 Aborted (core dumped)

maybe gpu's memory wasn't allocated right or some exception exists reconditely.

emjotde commented 7 years ago

My problem is I cannot reproduce this for any machine. I even tried running it on 8 GPUs and it works. What is your current setting for -w (--workspace) and the number of devices?

520jefferson commented 7 years ago

@emjotde Hi,this is myconfiguration: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train0825.src.ch.0926.filter.bpe_integrate train0825.tar.en.0926.filter.bpe_integrate \ --vocabs train0825.src.ch.0926.filter.bpe_integrate.pkl.json train0825.tar.en.0926.filter.bpe_integrate.pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.0001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 15000 \ --clip-norm 5. \ --layer-normalization \ --after-batches 20000000

run order: nohup ./amun_train.sh 0 1 & and the memory of each card is 22912MiB

emjotde commented 7 years ago

Can you post the output of nvidia-smi?

520jefferson commented 7 years ago
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:04:00.0 Off |                    0 |
| N/A   31C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:07:00.0 Off |                    0 |
| N/A   25C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:08:00.0 Off |                    0 |
| N/A   27C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:0C:00.0 Off |                    0 |
| N/A   27C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:0D:00.0 Off |                    0 |
| N/A   27C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:0E:00.0 Off |                    0 |
| N/A   25C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0    50W / 250W |      0MiB / 22912MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
emjotde commented 7 years ago

Is this with CUDA 9.0?

520jefferson commented 7 years ago

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Sun_Sep__4_22:14:01_CDT_2016 Cuda compilation tools, release 8.0, V8.0.44

emjotde commented 7 years ago

OK, will the try the recent driver and report back

emjotde commented 7 years ago

Have the new driver. Training on a K80 with 8 GPUs with 12GB each and using all the GPUs. Cannot reproduce this behaviour with -w 8500. Does it always happen that early into training, after just a couple of hundred iterations? Does it happen with smaller values like -w 4000 ?