Closed FutureShaper closed 7 years ago
Hi, Current master? We fixed a bug like that recently. And to understand better, it works for beam-sizes other than 6 and fails explicitly for 6?
It seems to fail for beam-size < 7.
Ok, I will try again with new clone from master.
hi, do you mean that for beam-size > 7 is working?
Yes, beam-size >=7 (>6) is fine.
I cloned again from -b master, but still getting the error.
@tomekd Sounds again like an nth-element issue.
I'm having the same problem on a different setup (with ensembling of 8 models). There is only one sentence that fails (in ~500), and slight changes to it (e.g. removing a whitespace) fix the problem. If I use the same problematic sentence with only 1 model, keeping every other configuration, the error goes away.
I pushed a bug fix. It seems that it was lost during merging. check it out again, please.
I just tried with the latest push (excluding the problematic logging-related commits, as discussed in the other issue) and got the same error.
is it possible to make your models and input file available for download so I can replicate it
I get a similar failure. This problem might be related to batch size. The behavior can be reproduced as follows (check out the latest version of master to get the script amun.py):
cd /path/to/your/amunmt/build/dir
make -j python
export AMUN_PYLIB_DIR=$(pwd)/src
mkdir issue50
${AMUNMT_ROOT}/scripts/download_models.py -w issue50
cd issue50
echo "one day , Los Angeles Times colum@@ n@@ ist Steve Lopez was walking along the streets of downtown Los Angeles when he heard beautiful music . <ted>" > input.txt
${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json < input.txt
This works, but you get a cudaFree error if you set batch size to 2:
${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json --mini-batch 2 --maxi-batch 2 < input.txt
=>
terminate called after throwing an instance of 'thrust::system::system_error'
what(): cudaFree in free: an illegal memory access was encountered
Bizarrely, I get this error only on this or similar input. Having played around with variants, I suspect that the period in the middle of the sentence is also key to success or failure. If we remove the period, things work. If we remove \<ted>, things also work. If we replace \<ted> by another (known) word, things fail.
(For those with access to valhalla: this happened on baldur)
can you please make the model available for download so I can test it myself
The step
${AMUN_ROOT}/scripts/download_models.py -w issue50
WILL download default models (en->de) into the directory issue50. Look at download_models.py to see what happens.
it's working ok for me on my machine: ${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json --mini-batch 2 --maxi-batch 2 < input.txt .... Best translation: eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Straßen der Innenstadt von Los Angeles spazieren , als er schöne Musik hörte . ['eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Stra\xc3\x9fen der Innenstadt von Los Angeles spazieren , als er sch\xc3\xb6ne Musik h\xc3\xb6rte .\n']
I get a compile error on baldur, prob 'cos I need some special path flags.
I've pushed the 'nothrust' branch I've been working on that may fix this. Can you see if it works for you
You probably need these set for compilation on valhalla:
export PATH=/usr/local/cuda-8.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LIBRARY_PATH
On Fri, Mar 10, 2017 at 10:41 PM, Hieu Hoang notifications@github.com wrote:
it's working ok for me on my machine: ${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json --mini-batch 2 --maxi-batch 2 < input.txt .... Best translation: eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Straßen der Innenstadt von Los Angeles spazieren , als er schöne Musik hörte . ['eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Stra\xc3\x9fen der Innenstadt von Los Angeles spazieren , als er sch\xc3\xb6ne Musik h\xc3\xb6rte .\n']
I get a compile error on baldur, prob 'cos I need some special path flags.
I've pushed the 'nothrust' branch I've been working on that may fix this. Can you see if it works for you
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amunmt/amunmt/issues/50#issuecomment-285805117, or mute the thread https://github.com/notifications/unsubscribe-auth/AEQLO86oe6SgSGnJLSq_PS-Y-OmvytPEks5rkdGUgaJpZM4MX3Dn .
-- Ulrich Germann Senior Researcher School of Informatics University of Edinburgh
gotcha, still crashes with my branch
as in "amun crashes" or "compilation crashes"?
On Fri, Mar 10, 2017 at 11:12 PM, Hieu Hoang notifications@github.com wrote:
gotcha, still crashes with my branch
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amunmt/amunmt/issues/50#issuecomment-285810825, or mute the thread https://github.com/notifications/unsubscribe-auth/AEQLOxc-Mv9SsMnEDhkAFixC5rBQGmlVks5rkdjZgaJpZM4MX3Dn .
-- Ulrich Germann Senior Researcher School of Informatics University of Edinburgh
amun crashes. I've got an inkling of what it could be but it's gonna take a while to fix
I've pushed a branch which contains lots of bug fixes. Probably fixes this particular issue too. @smatthay @ugermann please let me know if it works for you The branch is called 3d5
Lot's of stuff going on there. Just to make sure, can we keep this out of master until after the release?
sure
I experienced the same issue as @smatthay (and I still do with the master branch), but the issue is gone using branch 3d5.
Any chance that the fix will be merged into the master branch?
good idea. Done.
'Cos it a large change, you might want to do a new git clone, cmake, make. Just so there's no compile problems.
Please let me know if it runs so we can close this ticket
Hi Hieu, I have a freshly cloned and compiled AmuNMT, but the issue is still there unfortunately. To sum it up, I am trying beam size 2 and 8. The CPU version works with both, the GPU version only works with beam size 8. For 2, I get the following:
ERROR: an illegal memory access was encountered in /data/miriam/amunmt-test/amunmt/src/amun/gpu/mblas/nth_element.cu at line 334 ERROR: an illegal memory access was encountered in /data/miriam/amunmt-test/amunmt/src/amun/./gpu/mblas/matrix.h at line 158 terminate called after throwing an instance of 'thrust::system::system_error' what(): cudaFree in free: an illegal memory access was encountered Aborted (core dumped)
Can you make your models available for download so I can replicate the problem
The weird thing is that decoding with small beam sizes works with all models except one. All these models have been trained in a similar fashion (with Nematus). So I really do not know why one should be different than the others...
And I am afraid I can't share this problematic model because it is trained on corporate data. :\
i understand though it will be difficult to fix it without replicating the problem
You said that for 1 of your model the issue was fixed in the 3d5 branch, but not in the original master? Does that model run in the current master?
@hieuhoang if you removed thrust, how can there still be a thrust-related error?
@mkaeshammer Maybe ask someone higher up. It is impossible to reverse NMT models, so they might be OK to share?
I didn't remove all thrust, just thrust in the Matrix class
Are there instances of thrust::device_vector? They have to be replaced with matrices.
go for it.
From what I've seen though the issues are in old fashioned bugs, eg. reading after the end of vectors. It requires debugging, not blindly replacing thrust with something else.
@emjotde @hieuhoang I "anonymized" the vocabulary files by replacing the entries with numbers, and now got permission to share :) You should receive an email soon with the details. (Let me know if anybody else needs access or if you experience problems.)
Here is a summary of the issue:
Let me know if you are missing any details.
@mkaeshammer thanks. Can you please also send the model to @tomekd
Done.
@mkaeshammer If you're still interested in a fix for your issue, I think I've solved it. For now, the code is in my repository https://github.com/hieuhoang/amunmt in the 3d6 branch.
Please let me know if it works for you
I think the segfaults have been fixed. Please git pull and try again. I'll close this issue if there are no further reports
I'm getting the same error with this model: https://drive.google.com/file/d/0Bzz3wLacJ7WocGd5LUlCbkYyNmM/view?usp=sharing
while running amun
on the validation data:
https://drive.google.com/file/d/0Bzz3wLacJ7WoNHA4TnNQU0VQdnM/view?usp=sharing
with:
$ cat $(pwd)/valid.src | $AMUN -c $(pwd)/toymodel/model.npz.amun.yml -m $(pwd)/toymodel/model.iter3000.npz -d 0 1 2 3 -b 12 -n --mini-batch 10 --maxi-batch 1000
[out]:
[Wed July 12 15:54:31 2017] (I) Options: allow-unk: false
beam-size: 12
cpu-threads: 0
devices: [0, 1, 2, 3]
gpu-threads: 1
log-info: true
log-progress: true
max-length: 500
maxi-batch: 1000
mini-batch: 10
n-best: false
no-debpe: false
normalize: true
relative-paths: false
return-alignment: false
return-soft-alignment: false
scorers:
F0:
path: /home/liling/toymodel/model.iter3000.npz
type: Nematus
show-weights: false
softmax-filter:
[]
source-vocab: /home/liling/toymodel//vocab.src.yml
target-vocab: /home/liling/toymodel//vocab.trg.yml
weights:
F0: 1
wipo: false
[Wed July 12 15:54:31 2017] (I) Loading scorers...
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling/toymodel/model.iter3000.npz onto gpu 0
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling//toymodel/model.iter3000.npz onto gpu 1
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling/toymodel/model.iter3000.npz onto gpu 2
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling//toymodel/model.iter3000.npz onto gpu 3
[Wed July 12 15:54:35 2017] (I) Reading from stdin
[Wed July 12 15:54:35 2017] (I) Setting CPU thread count to 0
[Wed July 12 15:54:35 2017] (I) Setting GPU thread count to 1
[Wed July 12 15:54:35 2017] (I) Total number of threads: 4
[Wed July 12 15:54:35 2017] (I) Reading input
terminate called after throwing an instance of 'thrust::system::system_error'
what(): cudaFree in free: an illegal memory access was encountered
Aborted (core dumped)
My amun
built come from 5454c11a471ec9e258a81b8e5ac4d5bb01816738
$ git log
commit 5454c11a471ec9e258a81b8e5ac4d5bb01816738
Author: Tomasz Dwojak <t.dwojak@amu.edu.pl>
Date: Tue May 30 10:14:36 2017 +0000
Remove redundant God in Scorer::Decode()
Git pull
Ps. Your toy model has no toys
On 12 Jul 2017 8:56 a.m., "alvations" notifications@github.com wrote:
I'm getting the same error with this model: https://drive.google.com/file/ d/0Bzz3wLacJ7WocGd5LUlCbkYyNmM/view?usp=sharing
while running amun on the validation data: https://drive.google.com/file/d/0Bzz3wLacJ7WoNHA4TnNQU0VQdnM/ view?usp=sharing
with:
$ cat $(pwd)/valid.src | $AMUN -c $(pwd)/toymodel/model.npz.amun.yml -m $(pwd)/toymodel/model.iter3000.npz -d 0 1 2 3 -b 12 -n --mini-batch 10 --maxi-batch 1000
[out]:
[Wed July 12 15:54:31 2017] (I) Options: allow-unk: false beam-size: 12 cpu-threads: 0 devices: [0, 1, 2, 3] gpu-threads: 1 log-info: true log-progress: true max-length: 500 maxi-batch: 1000 mini-batch: 10 n-best: false no-debpe: false normalize: true relative-paths: false return-alignment: false return-soft-alignment: false scorers: F0: path: /home/ltan/IBot-Marian/toymodel/model.iter3000.npz type: Nematus show-weights: false softmax-filter: [] source-vocab: /home/ltan/IBot-Marian/toymodel//vocab.src.yml target-vocab: /home/ltan/IBot-Marian/toymodel//vocab.trg.yml weights: F0: 1 wipo: false
[Wed July 12 15:54:31 2017] (I) Loading scorers... [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 0 [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 1 [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 2 [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 3 [Wed July 12 15:54:35 2017] (I) Reading from stdin [Wed July 12 15:54:35 2017] (I) Setting CPU thread count to 0 [Wed July 12 15:54:35 2017] (I) Setting GPU thread count to 1 [Wed July 12 15:54:35 2017] (I) Total number of threads: 4 [Wed July 12 15:54:35 2017] (I) Reading input terminate called after throwing an instance of 'thrust::system::system_error' what(): cudaFree in free: an illegal memory access was encountered Aborted (core dumped)
I'm on 5454c11 https://github.com/marian-nmt/marian/commit/5454c11a471ec9e258a81b8e5ac4d5bb01816738 build
$ git log commit 5454c11a471ec9e258a81b8e5ac4d5bb01816738 Author: Tomasz Dwojak t.dwojak@amu.edu.pl Date: Tue May 30 10:14:36 2017 +0000
Remove redundant God in Scorer::Decode()
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian/issues/50#issuecomment-314685898, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFKhBLn9dhHlBtJW8ETttSPurJXAiks5sNHw-gaJpZM4MX3Dn .
Whoops, my bad. Pulling and re-making to try again.
See new comment.
Building from 54e92cf57a40f6e053388d491005dd264ee58e54
$ git log
commit 54e92cf57a40f6e053388d491005dd264ee58e54
Author: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Date: Tue Jul 11 18:55:23 2017 +0000
reverted to older spdlog
The model (new link): https://drive.google.com/file/d/0Bzz3wLacJ7WoR0JSS3ZOeXc3T3M/view?usp=sharing
and the validation data valid.src
:
https://drive.google.com/file/d/0Bzz3wLacJ7WoNHA4TnNQU0VQdnM/view?usp=sharing
A similar error occurs on the command:
$ cat valid.src | \
> $AMUN -c toymodel/model.npz.amun.yml -m toymodel/model.iter1000.npz -b 12 -n --mini-batch 10 --maxi-batch 1000
[out]:
$ cat toydata/valid.src | \
> $AMUN -c toymodel/model.npz.amun.yml -m toymodel/model.iter1000.npz -b 12 -n --mini-batch 10 --maxi-batch 1000
[Wed July 12 16:42:11 2017] (I) Options: allow-unk: false
beam-size: 12
cpu-threads: 0
devices: [0, 1, 2, 3]
gpu-threads: 1
log-info: true
log-progress: true
max-length: 500
maxi-batch: 1000
mini-batch: 10
n-best: false
no-debpe: false
normalize: true
relative-paths: false
return-alignment: false
scorers:
F0:
path: toymodel/model.iter1000.npz
type: Nematus
show-weights: false
softmax-filter:
[]
source-vocab: /home/liling/toymodel//vocab.src.yml
target-vocab: /home/liling/toymodel//vocab.trg.yml
weights:
F0: 1
wipo: false
[Wed July 12 16:42:11 2017] (I) Loading scorers...
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 0
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 1
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 2
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 3
[Wed July 12 16:42:15 2017] (I) Reading from stdin
[Wed July 12 16:42:15 2017] (I) Setting CPU thread count to
[Wed July 12 16:42:15 2017] (I) Setting GPU thread count to
[Wed July 12 16:42:15 2017] (I) Total number of threads:
[Wed July 12 16:42:15 2017] (I) Reading input
ERROR: an illegal memory access was encountered in /home/liling/marian/src/gpu/mblas/nth_element.cu at line 331
terminate called after throwing an instance of 'thrust::system::system_error'
what(): cudaFree in free: an illegal memory access was encounteredterminate called recursively
After re-running the amun
command several times, sometimes it throws different error:
[Wed July 12 16:42:11 2017] (I) Loading scorers...
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 0
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 1
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 2
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 3
[Wed July 12 16:42:15 2017] (I) Reading from stdin
[Wed July 12 16:42:15 2017] (I) Setting CPU thread count to
[Wed July 12 16:42:15 2017] (I) Setting GPU thread count to
[Wed July 12 16:42:15 2017] (I) Total number of threads:
[Wed July 12 16:42:15 2017] (I) Reading input
ERROR: an illegal memory access was encountered in /home/liling/marian/src/gpu/mblas/nth_element.cu at line 293
terminate called after throwing an instance of 'thrust::system::system_error'
what(): cudaFree in free: an illegal memory access was encounteredterminate called recursively
Sometimes no indication of the line no. but just termination:
[Wed July 12 16:42:11 2017] (I) Loading scorers...
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 0
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 1
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 2
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 3
[Wed July 12 16:42:15 2017] (I) Reading from stdin
[Wed July 12 16:42:15 2017] (I) Setting CPU thread count to
[Wed July 12 16:42:15 2017] (I) Setting GPU thread count to
[Wed July 12 16:42:15 2017] (I) Total number of threads:
[Wed July 12 16:42:15 2017] (I) Reading input
terminate called after throwing an instance of 'thrust::system::system_error'
what(): cudaFree in free: an illegal memory access was encounteredterminate called recursively
it runs for me with the master branch, on 2 gpus. And it give the same error as you on the un-updated branch (3d6_premerge).
This is the command I run, similar to your command: cat valid.src | $MARIAN/build/amun -c toymodel/model.npz.amun.yml -m toymodel/model.iter1000.npz -b 12 -n --mini-batch 10 --maxi-batch 1000 --devices 0 2
Make absolutely sure you are using the up to date code. cd build rm -rf * cmake .. make -j
also try using 1 gpu, then 2, then 3, and let me know if it works
Thanks @hieuhoang !!
I didn't build properly previously, force removing the build and re-compiling + re-making worked =)
I've tried all parameters, it worked for all --devices
settings.
cheers. I'm gonna make a reg test with your model so it amun doesn't fall out of bed again. Howl if you object
No objections =)
Thank you again!!
On 12 Jul 2017 7:14 p.m., "Hieu Hoang" notifications@github.com wrote:
cheers. I'm gonna make a reg test with your model so it amun doesn't fall out of bed again. Howl if you object
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian/issues/50#issuecomment-314738821, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAGzEUvNJcWdiY9cs23pzaoUok2NdeSks5sNKqfgaJpZM4MX3Dn .
Finally found the time to test as well. No error anymore. Great work - thank you very much!!
Hi there!
To reproduce this issue:
performance settings
beam-size: 7 devices: [7] #array of gpu devices normalize: yes
threads-per-device: 1
threads: 1
mode: CPU
gpu-threads: 1 cpu-threads: 0
scorer configuration
scorers: F0: path:
type: Nematus
scorer weights
weights: F0: 1.0
vocabularies
source-vocab:
target-vocab:
from websocket import create_connection import time
start_time = time.time()
with open("") as f:
ws = create_connection("ws://localhost:8080/translate")
for line in f:
print("Translating the following line:")
print(line.rstrip())
ws.send(line)
print("Target translation:")
result=ws.recv()
print(result)
ws.close()
... beam-size: 6 ...
terminate called after throwing an instance of 'thrust::system::system_error' what(): cudaFree in free: an illegal memory access was encountered