Closed 520jefferson closed 7 years ago
Hi, interesting, can you post your configuration or command line?
I believe this might already be fixed in one of our experimental branches (I had some problems with instable softmax there). Since the command line options are changing a bit as we are approaching to realease version 1.0 I could provide you with an updated config or command line invocation.
Hi, sorry for late reply. my configuration as follows: C2E confiuration: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train.src.ch.bpe train.tar.en.bpe \ --vocabs train.src.ch.bpe.pkl.json train.tar.en.bpe .pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --after-batches 20000000
E2C configuration:
../../build/marian \
--type s2s \
--model models_s2s/512-1024-en_ch.npz \
--devices $@ \
--seed 0 \
--dim-emb 512 \
--dim-rnn 1024 \
--train-set train.src.en.bpe train.tar.ch.bpe \
--vocabs train.src.en.bpe.pkl.json train.tar.ch.bpe .pkl.json \10
--dim-vocabs 31800 48550 \
--dec-cell-base-depth 2 \
--dec-cell-high-depth 2 \
--disp-freq 10 \
--save-freq 2000 \
--learn-rate 0.001 \
--max-length 50 \
--optimizer adam \
--mini-batch 256 \
--maxi-batch 20 \
--dropout-rnn 0.2 \
--dropout-src 0 \
--dropout-trg 0 \
--tempdir tmp \
--after-batches 20000000
@emjotde I can try with your updated config or command line.
OK, can you try master
branch from http://github.com/marian-nmt/marian-dev
with that config? It should not have the NaN problem any more.
However, I notice your learning rate is really high which is probably a secondary reason for the NaN to appear. Shortly before that happens the costs start fluctuating which probably results in overflows in the softmax. With a lower learning rate (the default is 0.0001) this is a lot less likely to happen.
You may also want to try --mini-batch-fit
(previously --dynamic-batching
). With this options it tries to adapt the mini-batch size to the available work space memory which you can fix for instance with --workspace 5000
or a larger number.
hi,@emjotde my configuration as follows: C2E: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train.src.ch.bpe train.tar.en.bpe \ --vocabs train.src.ch.bpe.pkl.json train.tar.en.bpe .pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.0001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 10000 \ --after-batches 20000000
E2C: ../../build/marian \ --type s2s \ --model models_s2s/512-1024-en_ch.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train.src.en.bpe train.tar.ch.bpe \ --vocabs train.src.en.bpe.pkl.json train.tar.ch.bpe .pkl.json \ --dim-vocabs 31800 48550 \ --dec-cell-base-depth 2 \ --dec-cell-high-depth 2 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.0001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 10000 \ --after-batches 20000000
when i already start train process,i met critical errors: nvidia-smi info,and my total memory is 22912MiB each GPU memory: | 4 17442 C Unknown Error 11609MiB | | 4 17570 C ../../build/marian 11077MiB | | 5 17442 C Unknown Error 11609MiB | | 5 17570 C ../../build/marian 11077MiB | | 6 17442 C Unknown Error 11609MiB |
C2E training: [2017-10-16 10:07:41] [memory] Reserving 232 MB, device 6 [2017-10-16 10:07:41] [memory] Reserving 232 MB, device 5 [2017-10-16 10:07:44] Ep. 1 : Up. 10 : Sen. 4121 : Cost 214.60 : Time 79.92s : 1038.37 words/s [2017-10-16 10:07:48] Ep. 1 : Up. 20 : Sen. 10240 : Cost 172.57 : Time 4.32s : 23102.98 words/s [2017-10-16 10:07:52] Ep. 1 : Up. 30 : Sen. 15360 : Cost 182.24 : Time 3.54s : 25759.34 words/s [2017-10-16 10:07:55] Ep. 1 : Up. 40 : Sen. 20050 : Cost 156.31 : Time 3.85s : 21480.16 words/s [2017-10-16 10:08:00] Ep. 1 : Up. 50 : Sen. 24880 : Cost 160.90 : Time 4.01s : 23223.76 words/s [2017-10-16 10:08:03] Ep. 1 : Up. 60 : Sen. 29804 : Cost 112.70 : Time 3.62s : 20172.80 words/s [2017-10-16 10:08:07] Ep. 1 : Up. 70 : Sen. 34260 : Cost 159.92 : Time 3.67s : 25135.64 words/s [2017-10-16 10:08:10] Ep. 1 : Up. 80 : Sen. 38950 : Cost 107.33 : Time 3.19s : 21998.28 words/s [2017-10-16 10:08:14] Ep. 1 : Up. 90 : Sen. 43960 : Cost 146.55 : Time 4.18s : 23579.54 words/s [2017-10-16 10:08:18] Ep. 1 : Up. 100 : Sen. 49080 : Cost 134.35 : Time 3.58s : 26416.84 words/s [2017-10-16 10:08:22] Ep. 1 : Up. 110 : Sen. 53817 : Cost 139.70 : Time 4.50s : 20401.84 words/s [2017-10-16 10:08:27] Ep. 1 : Up. 120 : Sen. 59700 : Cost 121.67 : Time 4.37s : 22746.39 words/s [2017-10-16 10:08:36] Ep. 1 : Up. 130 : Sen. 65410 : Cost 122.33 : Time 9.22s : 10515.02 words/s [2017-10-16 10:08:45] Ep. 1 : Up. 140 : Sen. 70710 : Cost 131.00 : Time 9.23s : 10338.31 words/s terminate called recursively terminate called after throwing an instance of 'terminate called recursively util::Exception' Aborted (core dumped)
then E2C training: [2017-10-16 10:12:26] Ep. 1 : Up. 610 : Sen. 266800 : Cost 138.25 : Time 3.72s : 23543.23 words/s [2017-10-16 10:12:30] Ep. 1 : Up. 620 : Sen. 271020 : Cost 115.07 : Time 3.27s : 21315.96 words/s [2017-10-16 10:12:33] Ep. 1 : Up. 630 : Sen. 275300 : Cost 146.81 : Time 3.38s : 25377.85 words/s [2017-10-16 10:12:37] Ep. 1 : Up. 640 : Sen. 279380 : Cost 133.66 : Time 3.87s : 19772.79 words/s [2017-10-16 10:12:41] Ep. 1 : Up. 650 : Sen. 283600 : Cost 148.91 : Time 3.75s : 23125.39 words/s [2017-10-16 10:12:44] Ep. 1 : Up. 660 : Sen. 288470 : Cost 110.06 : Time 3.66s : 20935.18 words/s [2017-10-16 10:12:48] Ep. 1 : Up. 670 : Sen. 293440 : Cost 112.88 : Time 3.33s : 23985.58 words/s [2017-10-16 10:12:52] Ep. 1 : Up. 680 : Sen. 297520 : Cost 171.12 : Time 4.08s : 23287.03 words/s [2017-10-16 10:12:55] Ep. 1 : Up. 690 : Sen. 301740 : Cost 120.48 : Time 3.56s : 20345.47 words/s [2017-10-16 10:12:59] Ep. 1 : Up. 700 : Sen. 306100 : Cost 118.58 : Time 3.47s : 21118.39 words/s [2017-10-16 10:13:02] Ep. 1 : Up. 710 : Sen. 310660 : Cost 148.38 : Time 3.66s : 25289.84 words/s [2017-10-16 10:13:06] Ep. 1 : Up. 720 : Sen. 314800 : Cost 138.74 : Time 3.95s : 20179.91 words/s [2017-10-16 10:13:09] Ep. 1 : Up. 730 : Sen. 319720 : Cost 101.57 : Time 3.19s : 22914.85 words/s [2017-10-16 10:13:13] Ep. 1 : Up. 740 : Sen. 323880 : Cost 156.47 : Time 3.71s : 24179.30 words/s [2017-10-16 10:13:17] Ep. 1 : Up. 750 : Sen. 328020 : Cost 139.17 : Time 3.83s : 20829.00 words/s [2017-10-16 10:13:20] Ep. 1 : Up. 760 : Sen. 332890 : Cost 114.52 : Time 3.46s : 23058.13 words/s [2017-10-16 10:13:24] Ep. 1 : Up. 770 : Sen. 336820 : Cost 137.83 : Time 3.40s : 22112.68 words/s [2017-10-16 10:13:27] Ep. 1 : Up. 780 : Sen. 340750 : Cost 139.18 : Time 3.59s : 21062.06 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: an illegal memory access was encountered marian-dev/src/tensors/tensor.cu 25 Aborted (core dumped)
when i change workspace to 8000 ,the error was the same. when i dropout the workspace and --mini-batch-fit, same errors show. why? get gpu memory error?
I understand both processes die at the same time? I would still guess that might be an insufficient space issue due to both processes. Maybe it tried to reallocate something and then ran out of memory.
I do not think it is a good idea to run two processes on one GPU anyway, maybe use a larger batch instead and keep one process per GPU?
I can try what you say ,but when one process die,the other also die because of memory.i think if it's cased by memory,the later will train normal after the former died. right?
Second,i put the two training processing in the same gpus using marian-nmt/marian,both running normally, and i set the learning_rate to 0.0001 just as you say,and both using dynamic-batching and workspace,if one is running,the other's workspace is not enough ,the log will show the info circularly and won't crash: [2017-10-16 14:26:58] [memory] Reserving 348 MB, device 4 [2017-10-16 14:26:59] [memory] Reserving 348 MB, device 5 [2017-10-16 14:26:59] [memory] Reserving 348 MB, device 6 if the other's workspace is enough both running normal.
1) Not necessarily, if they happen to both re-allocate at the same time they can choke each other. Although I would generally say training two models on GPU is rather not recommended. It is certainly untested for Marian.
2) I still recommend to try the current marian version from https://github.com/marian-nmt/marian-dev it has the more stable softmax. The --dynamic-batching
option has been renamed to --mini-batch-fit
there.
1、i set --dec-cell-base-depth 2 --dec-cell-high-depth 2 when training type is s2s,so s2s decode slower than amun ,because net structure ?
2、when translate amun -d 3, actually it will take gpu 0 3 ,confused why gpu0 is used?
3、when i set --mini-batch-fit --workspace 18000 the gpu shows: +-------------------------------+----------------------+----------------------+ | 7 Tesla P40 Off | 0000:0F:00.0 Off | 0 | | N/A 53C P0 190W / 250W | 19333MiB / 22912MiB | 99% Default | +-------------------------------+----------------------+----------------------+ logs: [2017-10-18 10:34:22] [memory] Extending reserved space to 18432 MB (device 7) [2017-10-18 10:34:22] [memory] Reserving 348 MB, device 7 [2017-10-18 10:34:23] [memory] Reserving 348 MB, device 7 waiting about 20 minute and doing nothing,is it normal?
1) Multiple reasons: s2s is a bit slower than amun, about 10% with comparable settings. Amun is hand coded for one architecture, s2s is general. Amun has batched decoding, so it can translate multiple sentences at once, we are working to add this to s2s (--mini-batch >1 for Amun activates this). And it will of course be caused by the architecture.
2) Bug in amun. Difficult to fix for us as we don't see it on our machines, but has been reported multiple times.
3) I think it is collecting statistics for batch-fitting here. With that large amount of memory that may take a while as it is increasing the batch-size for different sentence lengths and checking whether they fit. What did you set --disp-freq to?
@emjotde 3、i set --disp-freq 10 .whether i should give so big workspace?although training process will take such a fixed space,but it's won't use so big ? (i set --max-len 50)
It will fill up the space with a larger number of sentences. So it might use an average mini-batch size of 500 or more.
@hieuhoang sorry, i didn't save the model after a new round of training and i will share with you when i met the problem. there exists security licence in the company,i cann't provide acess to you ,sorry. By increasing the learning_rate and more corpus and iteration will help?
Sorry, I did not get the last questions?
you means mini-batch equals maxlen*dispy_freq?i always think the disp-freq is just helpful to see cost.
the last question is about the relationship between workspace and mini-batch, if the min-batch (i set 256) lower,then the utilization ratio will low and workspace won't be used sufficiently?
i also training with nematus (https://github.com/EdinburghNLP/nematus) ,the same corpus and setting. i get value of bleu meter as follows nematus c2e:0.4058 | 0.610273 nematus e2c:0.5152 | 0.60306 marian-master:0.3845 | 0.560889 marian-master:0.4549 | 0.537081 the marian training so fast comparing to nematus.but why the bleu and meter is lower. what i can do to improve the bleu and meter?
1) No, there is no relation between disp-freq and mini-batch size, as you said it is only for display.
2) Can you post your complete Nematus configs and marian configs for both experiments? Otherwise I cannot tell where the differences are. The models are mathematically basically the same, there should be no such differences for the same settings.
@emjotde as you say,the config as follows,but we change the nematus to support multi-gpus. c2e nematus:
train(train_len=100,
dim_word=512, # word vector dimensionality
dim=1024, # the number of LSTM units
factors=1, # input factors
dim_per_factor=None, # list of word vector dimensionalities (one per factor): [250,200,50] for total dimensionality of 500
encoder='gru',
decoder='gru_cond',
patience=10, # early stopping patience
max_epochs=8000,
finish_after=20000000, # finish after this many updates
dispFreq=10,
decay_c=0., # L2 regularization penalty
map_decay_c=0., # L2 regularization penalty towards original weights
clip_c=5., # gradient clipping threshold
lrate=0.001, # learning rate
n_words_src=48550, # source vocabulary size
n_words=31800, # target vocabulary size
maxlen=50, # maximum length of the description
optimizer='adam',
batch_size=128,
valid_batch_size=16,
saveto='models_1010/512-1024-ch_en',
validFreq=0,
saveFreq=2000, # save the parameters after every saveFreq updates
sampleFreq=2000, # generate some samples after every sampleFreq
datasets=[
home + 'train0825.src.ch.0926.filter.bpe_integrate.shuf',
home + 'train0825.tar.en.0926.filter.bpe_integrate.shuf'],
valid_datasets=['../data/dev/newstest2011.en.tok',
'../data/dev/newstest2011.fr.tok'],
dictionaries=[
home + 'train0825.src.ch.0926.filter.bpe_integrate.pkl',
home + 'train0825.tar.en.0926.filter.bpe_integrate.pkl'],
use_dropout=True,
dropout_embedding=0.2, # dropout for input embeddings (0: no dropout)
dropout_hidden=0.2, # dropout for hidden layers (0: no dropout)
dropout_source=0, # dropout source words (0: no dropout)
dropout_target=0, # dropout target words (0: no dropout)
reload_=True,
reload_training_progress=True, # reload trainig progress (only used if reload_ is True)
overwrite=False,
external_validation_script=None,
shuffle_each_epoch=True,
sort_by_length=True,
use_domain_interpolation=False, # interpolate between an out-domain training corpus and an in-domain training corpus
domain_interpolation_min=0.1, # minimum (initial) fraction of in-domain training data
domain_interpolation_max=1.0, # maximum fraction of in-domain training data
domain_interpolation_inc=0.1, # interpolation increment to be applied each time patience runs out, until maximum amount of interpolation is reached
domain_interpolation_indomain_datasets=['indomain.en', 'indomain.fr'], # in-domain parallel training corpus
maxibatch_size=20, #How many minibatches to load at one time
objective="CE", #CE: cross-entropy; MRT: minimum risk training (see https://www.aclweb.org/anthology/P/P16/P16-1159.pdf)
mrt_alpha=0.005,
mrt_samples=100,
mrt_samples_meanloss=10,
mrt_reference=True,
mrt_loss="SENTENCEBLEU n=4", # loss function for minimum risk training
mrt_ml_mix=0.5, # interpolate mrt loss with ML loss
model_version=0.1, #store version used for training for compatibility
prior_model=None, # Prior model file, used for MAP
tie_encoder_decoder_embeddings=False, # Tie the input embeddings of the encoder and the decoder (first factor only)
tie_decoder_embeddings=False, # Tie the input embeddings of the decoder with the softmax output embeddings
encoder_truncate_gradient=-1, # Truncate BPTT gradients in the encoder to this value. Use -1 for no truncation
decoder_truncate_gradient=-1, # Truncate BPTT gradients in the decoder to this value. Use -1 for no truncation
)
c2e marian master:
../../build/marian \
--type amun \
--model models_amun/512-1024-ch_en.npz \
--devices $@ \
--seed 0 \
--dim-emb 512 \
--dim-rnn 1024 \
--train-set train0825.src.ch.0926.filter.bpe_integrate train0825.tar.en.0926.filter.bpe_integrate \
--vocabs train0825.src.ch.0926.filter.bpe_integrate.pkl.json train0825.tar.en.0926.filter.bpe_integrate.pkl.json \
--dim-vocabs 48550 31800 \
--disp-freq 10 \
--save-freq 2000 \
--learn-rate 0.0001 \
--max-length 50 \
--optimizer adam \
--mini-batch 256 \
--maxi-batch 20 \
--dropout-rnn 0.2 \
--dropout-src 0 \
--dropout-trg 0 \
--tempdir tmp \
--dynamic-batching \
--workspace 9000 \
--after-batches 20000000
Hm, the main differences I am seeing are:
clip_c=5., # gradient clipping threshold
lrate=0.001, # learning rate
So, you would get the same by setting --learn-rate 0.001 --clip-norm 5
for marian. If the large learning rate worked for Nematus it should also be fine for Marian with the fixed softmax in marian-dev.
How long did you train (how many epochs, updates)? In a multi-gpu setting convergence can be a little slower in the beginning than using one GPU.
i train with nematus about 8 days. i train the with marian-master 2 days and 4 hours . the bleu of them is hard to be more high.
now the display info for marian is: [2017-10-18 18:14:31] Ep. 7 : Up. 306350 : Sen. 5630845 : Cost 34.46 : Time 6.91s : 13596.73 words/s [2017-10-18 18:14:37] Ep. 7 : Up. 306360 : Sen. 5635375 : Cost 41.04 : Time 6.20s : 14970.67 words/s [2017-10-18 18:14:43] Ep. 7 : Up. 306370 : Sen. 5640755 : Cost 30.26 : Time 5.84s : 14396.02 words/s [2017-10-18 18:14:48] Ep. 7 : Up. 306380 : Sen. 5645115 : Cost 31.29 : Time 5.46s : 13058.30 words/s [2017-10-18 18:14:54] Ep. 7 : Up. 306390 : Sen. 5650185 : Cost 34.51 : Time 6.16s : 14634.79 words/s [2017-10-18 18:15:01] Ep. 7 : Up. 306400 : Sen. 5654715 : Cost 39.17 : Time 6.18s : 14526.57 words/s [2017-10-18 18:15:07] Ep. 7 : Up. 306410 : Sen. 5659455 : Cost 31.46 : Time 6.43s : 11930.19 words/s [2017-10-18 18:15:13] Ep. 7 : Up. 306420 : Sen. 5664195 : Cost 33.37 : Time 5.54s : 14815.46 words/s
That should be long enough. I cannot really infere anything from that. How do you call the translation processes for both, Nematus and Marian?
yes,that's what i mean ,now i'm trying to train with marian-dev,and change the lrate=0.001 and clip_c=5. as you say.
What I meant: Could you also post the commands you use to translate for both?
sorry for missing that. nematus:
Decoder=${HOME}/work/nmt/tools/amunmt-master0330/build/bin/amun
cat c2e.yml > tmp.yml
echo " path: "${ModelPath}".iter"${iter}.npz >> tmp.yml
echo "input-file: "${INPUT} >> tmp.yml
${Decoder} -c tmp.yml > ${Testout}.at
c2e.yml:
# Paths are relative to config file location
relative-paths: yes
# performance settings
beam-size: 11
devices: [0]
normalize: yes
gpu:threads: 4
cpu-threads: 0
# scorer weights
weights:
F0: 1.0
bpe: c2e_0926/train0825.src.ch.0926.filter.codes
debpe: yes
return-alignment: no
#wipo: true
# vocabularies
source-vocab: train0825.src.ch.0926.filter.bpe_integrate.pkl.json
target-vocab: train0825.tar.en.0926.filter.bpe_integrate.pkl.json
scorers:
F0:
type: Nematus
marian: c2e:
marian-master/build/amun -d 3 -m $model -s train0825.src.ch.0926.filter.bpe_integrate.pkl.json -t train0825.tar.en.0926.filter.bpe_integrate.pkl.json -i $INPUT > ${Testout}.at
e2c:
marian-master/build/s2s -m $model -d 2 -v train0825.src.en.0926.filter.bpe_integrate.pkl.json train0825.tar.ch.0926.filter.bpe_integrate.pkl.json -i $INPUT > ${Testout}.at
That looks fine. OK, so you are using amun for models trained with Nematus and trained with Marian (for shallow models). So the difference is not coming from there, right?
Hm. Let's see what the new experiments do. I suppose your data is not public and you cannot share for testing.
With my own experiments I usually have near identical results to Nematus, for instance we had no problems reproducing Edinburgh's WMT 2017 systems with similar or slighlty higher results.
i do not understand completly (using amun for models trained with Nematus and trained with Marian) to nematus, i use the nematus training and amun decode. to marian, i use the tool you provide to train and decode
the data is not public and i cannot get it out for some reason.
i can try some new experiments with your help.
OK, what is your command for translating with Nematus?
nematus: Decoder=${HOME}/work/nmt/tools/amunmt-master0330/build/bin/amun cat c2e.yml > tmp.yml echo " path: "${ModelPath}".iter"${iter}.npz >> tmp.yml echo "input-file: "${INPUT} >> tmp.yml ${Decoder} -c tmp.yml > ${Testout}.at that's not what you want?
I was confused, because amun is something we provide too :)
You can use amun for the marian-trained models as well as long as you use --type amun
, but not for the deep models.
Are the worse results, for a model that resulted in NaN later?
marian-master is so fast so we want to change from nematus's training to marian-master's . so i try to use it. the cost euqals NaN ,it's the first time i trian with marian-master and config like nematus. when translating i use marian-master's tools to decode as you provided in the project‘s README.
i use nematus to train ,but use amun(predecessor of marian-master) because decoder is more fast than nematus's.
If the problem with the worse quality persists, it would be good to try and confirm it with a public data set so that I can repeat the experiments.
And for both, Nematus and Marian, it is usually very beneficial to enable layer-normalization. Both have that option. It should result in better convergence and better translation quality.
I will try what you say about layer-normalization . And i will do same experiments to check the difference of nematus and marian when the learning_rate and clip_c is the same.
Actually, I just checked a larger example training and I am experiencing trouble training with multiple GPUs, mentioned in https://github.com/marian-nmt/marian-dev/issues/119 . Maybe let me fix this first before you try new experiments.
Ok, thanks, i just train if i find some problem i can share with you.
PS:when the 4 gpus train begins i met the problem. [2017-10-18 20:02:54] Ep. 1 : Up. 4010 : Sen. 7140 : Cost 120.74 : Time 83.27s : 1386.23 words/s [2017-10-18 20:02:58] Ep. 1 : Up. 4020 : Sen. 13750 : Cost 156.24 : Time 4.25s : 29302.56 words/s [2017-10-18 20:03:02] Ep. 1 : Up. 4030 : Sen. 20480 : Cost 155.16 : Time 3.62s : 35186.56 words/s [2017-10-18 20:03:06] Ep. 1 : Up. 4040 : Sen. 27760 : Cost 130.64 : Time 3.99s : 29615.64 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: an illegal memory access was encountered marian-dev/src/tensors/tensor.cu 70 terminate called recursively Aborted (core dumped)
I believe I fixed that problem a couple of hours ago in marian-dev
, can you try again? It might also fix the problem you had above with running more than one process of marian-dev
on a single GPU.
@emjotde when i update codes to commit:fce8a3b6ffbdc75e603a0a65f56c5f4fec7cbac9 i start the traing process i get this: [2017-10-19 22:52:23] [memory] Reserving 174 MB, device 1 [2017-10-19 22:52:23] [memory] Reserving 174 MB, device 3 [2017-10-19 22:52:25] Ep. 1 : Up. 10 : Sen. 4460 : Cost 167.19 : Time 91.82s : 780.61 words/s [2017-10-19 22:52:28] Ep. 1 : Up. 20 : Sen. 8155 : Cost 157.17 : Time 2.36s : 29833.05 words/s [2017-10-19 22:52:30] Ep. 1 : Up. 30 : Sen. 12670 : Cost 127.62 : Time 2.76s : 28199.68 words/s [2017-10-19 22:52:33] Ep. 1 : Up. 40 : Sen. 16560 : Cost 151.11 : Time 2.65s : 29615.65 words/s [2017-10-19 22:52:36] Ep. 1 : Up. 50 : Sen. 20480 : Cost 115.27 : Time 2.62s : 24470.53 words/s [2017-10-19 22:52:38] Ep. 1 : Up. 60 : Sen. 24940 : Cost 113.70 : Time 2.08s : 35117.11 words/s [2017-10-19 22:52:41] Ep. 1 : Up. 70 : Sen. 28881 : Cost 160.09 : Time 3.26s : 26811.33 words/s [2017-10-19 22:52:43] Ep. 1 : Up. 80 : Sen. 33160 : Cost 94.02 : Time 1.97s : 30898.42 words/s [2017-10-19 22:52:46] Ep. 1 : Up. 90 : Sen. 37001 : Cost 134.16 : Time 2.56s : 29507.47 words/s [2017-10-19 22:52:48] Ep. 1 : Up. 100 : Sen. 41500 : Cost 118.34 : Time 2.77s : 28487.18 words/s [2017-10-19 22:52:51] Ep. 1 : Up. 110 : Sen. 45760 : Cost 95.78 : Time 2.68s : 23694.72 words/s [2017-10-19 22:52:54] Ep. 1 : Up. 120 : Sen. 49590 : Cost 133.20 : Time 2.64s : 29071.47 words/s [2017-10-19 22:52:56] Ep. 1 : Up. 130 : Sen. 53792 : Cost 108.59 : Time 2.20s : 32078.57 words/s [2017-10-19 22:52:59] Ep. 1 : Up. 140 : Sen. 57557 : Cost 129.48 : Time 2.87s : 25614.82 words/s [2017-10-19 22:53:02] Ep. 1 : Up. 150 : Sen. 61770 : Cost 120.47 : Time 2.94s : 26475.86 words/s [2017-10-19 22:53:04] Ep. 1 : Up. 160 : Sen. 65855 : Cost 130.29 : Time 2.33s : 35191.87 words/s [2017-10-19 22:53:06] Ep. 1 : Up. 170 : Sen. 70690 : Cost 86.19 : Time 2.40s : 28417.73 words/s [2017-10-19 22:53:10] Ep. 1 : Up. 180 : Sen. 74830 : Cost 119.75 : Time 3.07s : 25329.12 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: unspecified launch failure marian-dev/src/tensors/tensor.cu 70 Aborted (core dumped)
my configuration: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set corpus/train0825_CE/c2e_0926/train0825.src.ch.0926.filter.bpe_integrate corpus/train0825_CE/c2e_0926/train0825.tar.en.0926.filter.bpe_integrate \ --vocabs corpus/train0825_CE/c2e_0926/train0825.src.ch.0926.filter.bpe_integrate.pkl.json corpus/train0825_CE/c2e_0926/train0825.tar.en.0926.filter.bpe_integrate.pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 9500 \ --clip-norm 5. \ --layer-normalization \ --after-batches 20000000
@emjotde Does the problem exist? as soon as i start training process it will be interrupted by the some errors mentioned above.
Give me a few more days. I am still trying to figure out what is actually going on.
Ok,thanks,and are there some design documents for others to learn the codes more quickly?
I've added a couple of new guards against NaNs and I think we have a bit better error handling/logging now. Might be worth it to try the newest master in marian-dev
.
@emjotde now i update codes of marian-dev then i start the training program but i met the error again: [2017-10-24 10:08:52] Ep. 1 : Up. 590 : Sen. 308701 : Cost 85.70 : Time 6.02s : 15248.24 words/s [2017-10-24 10:08:59] Ep. 1 : Up. 600 : Sen. 313980 : Cost 92.22 : Time 6.55s : 15170.45 words/s [2017-10-24 10:09:05] Ep. 1 : Up. 610 : Sen. 319450 : Cost 78.50 : Time 5.56s : 16362.45 words/s [2017-10-24 10:09:11] Ep. 1 : Up. 620 : Sen. 324580 : Cost 88.74 : Time 6.43s : 14500.57 words/s [2017-10-24 10:09:17] Ep. 1 : Up. 630 : Sen. 329316 : Cost 88.10 : Time 5.97s : 14635.18 words/s [2017-10-24 10:09:23] Ep. 1 : Up. 640 : Sen. 334180 : Cost 79.02 : Time 6.09s : 13384.21 words/s [2017-10-24 10:09:31] Ep. 1 : Up. 650 : Sen. 340010 : Cost 86.61 : Time 7.47s : 14330.95 words/s terminate called after throwing an instance of 'util::Exception' what(): marian-dev/src/kernels/cuda_helpers.h:13 in void gpuAssert(cudaError_t, const char*, int, bool) threw util::Exception. GPUassert: an illegal memory access was encountered marian-dev/src/tensors/tensor.cu 25 Aborted (core dumped)
maybe gpu's memory wasn't allocated right or some exception exists reconditely.
My problem is I cannot reproduce this for any machine. I even tried running it on 8 GPUs and it works. What is your current setting for -w (--workspace) and the number of devices?
@emjotde Hi,this is myconfiguration: ../../build/marian \ --type amun \ --model models_amun/512-1024-ch_en.npz \ --devices $@ \ --seed 0 \ --dim-emb 512 \ --dim-rnn 1024 \ --train-set train0825.src.ch.0926.filter.bpe_integrate train0825.tar.en.0926.filter.bpe_integrate \ --vocabs train0825.src.ch.0926.filter.bpe_integrate.pkl.json train0825.tar.en.0926.filter.bpe_integrate.pkl.json \ --dim-vocabs 48550 31800 \ --disp-freq 10 \ --save-freq 2000 \ --learn-rate 0.0001 \ --max-length 50 \ --optimizer adam \ --mini-batch 256 \ --maxi-batch 20 \ --dropout-rnn 0.2 \ --dropout-src 0 \ --dropout-trg 0 \ --tempdir tmp \ --mini-batch-fit \ --workspace 15000 \ --clip-norm 5. \ --layer-normalization \ --after-batches 20000000
run order: nohup ./amun_train.sh 0 1 & and the memory of each card is 22912MiB
Can you post the output of nvidia-smi
?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:04:00.0 Off | 0 |
| N/A 31C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:06:00.0 Off | 0 |
| N/A 34C P0 50W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:07:00.0 Off | 0 |
| N/A 25C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:08:00.0 Off | 0 |
| N/A 27C P0 50W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P40 Off | 00000000:0C:00.0 Off | 0 |
| N/A 27C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P40 Off | 00000000:0D:00.0 Off | 0 |
| N/A 27C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P40 Off | 00000000:0E:00.0 Off | 0 |
| N/A 25C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P40 Off | 00000000:0F:00.0 Off | 0 |
| N/A 22C P0 50W / 250W | 0MiB / 22912MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Is this with CUDA 9.0?
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Sun_Sep__4_22:14:01_CDT_2016 Cuda compilation tools, release 8.0, V8.0.44
OK, will the try the recent driver and report back
Have the new driver. Training on a K80 with 8 GPUs with 12GB each and using all the GPUs. Cannot reproduce this behaviour with -w 8500. Does it always happen that early into training, after just a couple of hundred iterations? Does it happen with smaller values like -w 4000 ?
[2017-10-14 02:03:19] Ep. 2 : Up. 116760 : Sen. 6962176 : Cost 111.24 : Time 3.57s : 11717.49 words/s [2017-10-14 02:03:22] Ep. 2 : Up. 116770 : Sen. 6964736 : Cost 145.96 : Time 3.45s : 14980.89 words/s [2017-10-14 02:03:26] Ep. 2 : Up. 116780 : Sen. 6967296 : Cost 139.78 : Time 4.01s : 12567.44 words/s [2017-10-14 02:03:30] Ep. 2 : Up. 116790 : Sen. 6969856 : Cost 87.32 : Time 3.71s : 9522.98 words/s [2017-10-14 02:03:33] Ep. 2 : Up. 116800 : Sen. 6972416 : Cost nan : Time 2.74s : 19413.05 words/s [2017-10-14 02:03:36] Ep. 2 : Up. 116810 : Sen. 6974976 : Cost nan : Time 3.23s : 13540.76 words/s [2017-10-14 02:03:40] Ep. 2 : Up. 116820 : Sen. 6977536 : Cost nan : Time 3.95s : 13095.72 words/s [2017-10-14 02:03:43] Ep. 2 : Up. 116830 : Sen. 6980096 : Cost nan : Time 3.20s : 11438.62 words/supdate when training update to 116800 times,the cost equals nan,and the model is invalid,does someone meet the same problem? PS:the scale of corpus is about 24m with multi-gpu @emjotde