marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.24k stars 233 forks source link

cuda: an illegal memory access was encountered #50

Closed FutureShaper closed 7 years ago

FutureShaper commented 7 years ago

Hi there!

To reproduce this issue:

  1. Create config.yml:
    
    # Paths are relative to config file location
    relative-paths: no

performance settings

beam-size: 7 devices: [7] #array of gpu devices normalize: yes

threads-per-device: 1

threads: 1

mode: CPU

gpu-threads: 1 cpu-threads: 0

scorer configuration

scorers: F0: path: type: Nematus

scorer weights

weights: F0: 1.0

vocabularies

source-vocab: target-vocab:


2. Create script "latency_client.py"

from websocket import create_connection import time

start_time = time.time()

with open("") as f: ws = create_connection("ws://localhost:8080/translate") for line in f: print("Translating the following line:") print(line.rstrip()) ws.send(line) print("Target translation:") result=ws.recv() print(result) ws.close()


3. Start server
`python scripts/amunmt_server.py -c config_gpu.yml -p 8080`

4. Run "latency_client.py"
`python latency_client.py`

5. See it work.

6. Change config.yml: "beam-size: 6"

... beam-size: 6 ...


7. Start server

8. Run "latency_client.py"

9. Server-side error:

terminate called after throwing an instance of 'thrust::system::system_error' what(): cudaFree in free: an illegal memory access was encountered



(10. Client-side error)
`websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.`

Thanks!
Simon
emjotde commented 7 years ago

Hi, Current master? We fixed a bug like that recently. And to understand better, it works for beam-sizes other than 6 and fails explicitly for 6?

FutureShaper commented 7 years ago

It seems to fail for beam-size < 7.

Ok, I will try again with new clone from master.

tomekd commented 7 years ago

hi, do you mean that for beam-size > 7 is working?

FutureShaper commented 7 years ago

Yes, beam-size >=7 (>6) is fine.

I cloned again from -b master, but still getting the error.

emjotde commented 7 years ago

@tomekd Sounds again like an nth-element issue.

josemonteiro commented 7 years ago

I'm having the same problem on a different setup (with ensembling of 8 models). There is only one sentence that fails (in ~500), and slight changes to it (e.g. removing a whitespace) fix the problem. If I use the same problematic sentence with only 1 model, keeping every other configuration, the error goes away.

tomekd commented 7 years ago

I pushed a bug fix. It seems that it was lost during merging. check it out again, please.

josemonteiro commented 7 years ago

I just tried with the latest push (excluding the problematic logging-related commits, as discussed in the other issue) and got the same error.

hieuhoang commented 7 years ago

is it possible to make your models and input file available for download so I can replicate it

ugermann commented 7 years ago

I get a similar failure. This problem might be related to batch size. The behavior can be reproduced as follows (check out the latest version of master to get the script amun.py):

cd /path/to/your/amunmt/build/dir
make -j python
export AMUN_PYLIB_DIR=$(pwd)/src
mkdir issue50
${AMUNMT_ROOT}/scripts/download_models.py -w issue50
cd issue50
echo "one day , Los Angeles Times colum@@ n@@ ist Steve Lopez was walking along the streets of downtown Los Angeles when he heard beautiful music . <ted>" > input.txt
${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json < input.txt

This works, but you get a cudaFree error if you set batch size to 2:

${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json --mini-batch 2 --maxi-batch 2 < input.txt

=>

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  cudaFree in free: an illegal memory access was encountered

Bizarrely, I get this error only on this or similar input. Having played around with variants, I suspect that the period in the middle of the sentence is also key to success or failure. If we remove the period, things work. If we remove \<ted>, things also work. If we replace \<ted> by another (known) word, things fail.

(For those with access to valhalla: this happened on baldur)

hieuhoang commented 7 years ago

can you please make the model available for download so I can test it myself

ugermann commented 7 years ago

The step

${AMUN_ROOT}/scripts/download_models.py -w issue50 

WILL download default models (en->de) into the directory issue50. Look at download_models.py to see what happens.

hieuhoang commented 7 years ago

it's working ok for me on my machine: ${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json --mini-batch 2 --maxi-batch 2 < input.txt .... Best translation: eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Straßen der Innenstadt von Los Angeles spazieren , als er schöne Musik hörte . ['eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Stra\xc3\x9fen der Innenstadt von Los Angeles spazieren , als er sch\xc3\xb6ne Musik h\xc3\xb6rte .\n']

I get a compile error on baldur, prob 'cos I need some special path flags.

I've pushed the 'nothrust' branch I've been working on that may fix this. Can you see if it works for you

ugermann commented 7 years ago

You probably need these set for compilation on valhalla:

export PATH=/usr/local/cuda-8.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LIBRARY_PATH

On Fri, Mar 10, 2017 at 10:41 PM, Hieu Hoang notifications@github.com wrote:

it's working ok for me on my machine: ${AMUNMT_ROOT}/scripts/amun.py -m model.npz -s vocab.en.json -t vocab.de.json --mini-batch 2 --maxi-batch 2 < input.txt .... Best translation: eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Straßen der Innenstadt von Los Angeles spazieren , als er schöne Musik hörte . ['eines Tages war Los Angeles Times Kolum@@ n@@ ist Steve Lopez auf den Stra\xc3\x9fen der Innenstadt von Los Angeles spazieren , als er sch\xc3\xb6ne Musik h\xc3\xb6rte .\n']

I get a compile error on baldur, prob 'cos I need some special path flags.

I've pushed the 'nothrust' branch I've been working on that may fix this. Can you see if it works for you

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amunmt/amunmt/issues/50#issuecomment-285805117, or mute the thread https://github.com/notifications/unsubscribe-auth/AEQLO86oe6SgSGnJLSq_PS-Y-OmvytPEks5rkdGUgaJpZM4MX3Dn .

-- Ulrich Germann Senior Researcher School of Informatics University of Edinburgh

hieuhoang commented 7 years ago

gotcha, still crashes with my branch

ugermann commented 7 years ago

as in "amun crashes" or "compilation crashes"?

On Fri, Mar 10, 2017 at 11:12 PM, Hieu Hoang notifications@github.com wrote:

gotcha, still crashes with my branch

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amunmt/amunmt/issues/50#issuecomment-285810825, or mute the thread https://github.com/notifications/unsubscribe-auth/AEQLOxc-Mv9SsMnEDhkAFixC5rBQGmlVks5rkdjZgaJpZM4MX3Dn .

-- Ulrich Germann Senior Researcher School of Informatics University of Edinburgh

hieuhoang commented 7 years ago

amun crashes. I've got an inkling of what it could be but it's gonna take a while to fix

hieuhoang commented 7 years ago

I've pushed a branch which contains lots of bug fixes. Probably fixes this particular issue too. @smatthay @ugermann please let me know if it works for you The branch is called 3d5

emjotde commented 7 years ago

Lot's of stuff going on there. Just to make sure, can we keep this out of master until after the release?

hieuhoang commented 7 years ago

sure

mkaeshammer commented 7 years ago

I experienced the same issue as @smatthay (and I still do with the master branch), but the issue is gone using branch 3d5.

mkaeshammer commented 7 years ago

Any chance that the fix will be merged into the master branch?

hieuhoang commented 7 years ago

good idea. Done.

'Cos it a large change, you might want to do a new git clone, cmake, make. Just so there's no compile problems.

Please let me know if it runs so we can close this ticket

mkaeshammer commented 7 years ago

Hi Hieu, I have a freshly cloned and compiled AmuNMT, but the issue is still there unfortunately. To sum it up, I am trying beam size 2 and 8. The CPU version works with both, the GPU version only works with beam size 8. For 2, I get the following:

ERROR: an illegal memory access was encountered in /data/miriam/amunmt-test/amunmt/src/amun/gpu/mblas/nth_element.cu at line 334 ERROR: an illegal memory access was encountered in /data/miriam/amunmt-test/amunmt/src/amun/./gpu/mblas/matrix.h at line 158 terminate called after throwing an instance of 'thrust::system::system_error' what(): cudaFree in free: an illegal memory access was encountered Aborted (core dumped)

hieuhoang commented 7 years ago

Can you make your models available for download so I can replicate the problem

mkaeshammer commented 7 years ago

The weird thing is that decoding with small beam sizes works with all models except one. All these models have been trained in a similar fashion (with Nematus). So I really do not know why one should be different than the others...

And I am afraid I can't share this problematic model because it is trained on corporate data. :\

hieuhoang commented 7 years ago

i understand though it will be difficult to fix it without replicating the problem

You said that for 1 of your model the issue was fixed in the 3d5 branch, but not in the original master? Does that model run in the current master?

emjotde commented 7 years ago

@hieuhoang if you removed thrust, how can there still be a thrust-related error?

@mkaeshammer Maybe ask someone higher up. It is impossible to reverse NMT models, so they might be OK to share?

hieuhoang commented 7 years ago

I didn't remove all thrust, just thrust in the Matrix class

tomekd commented 7 years ago

Are there instances of thrust::device_vector? They have to be replaced with matrices.

hieuhoang commented 7 years ago

go for it.

From what I've seen though the issues are in old fashioned bugs, eg. reading after the end of vectors. It requires debugging, not blindly replacing thrust with something else.

mkaeshammer commented 7 years ago

@emjotde @hieuhoang I "anonymized" the vocabulary files by replacing the entries with numbers, and now got permission to share :) You should receive an email soon with the details. (Let me know if anybody else needs access or if you experience problems.)

Here is a summary of the issue:

Let me know if you are missing any details.

hieuhoang commented 7 years ago

@mkaeshammer thanks. Can you please also send the model to @tomekd

mkaeshammer commented 7 years ago

Done.

hieuhoang commented 7 years ago

@mkaeshammer If you're still interested in a fix for your issue, I think I've solved it. For now, the code is in my repository https://github.com/hieuhoang/amunmt in the 3d6 branch.

Please let me know if it works for you

hieuhoang commented 7 years ago

I think the segfaults have been fixed. Please git pull and try again. I'll close this issue if there are no further reports

alvations commented 7 years ago

I'm getting the same error with this model: https://drive.google.com/file/d/0Bzz3wLacJ7WocGd5LUlCbkYyNmM/view?usp=sharing

while running amun on the validation data: https://drive.google.com/file/d/0Bzz3wLacJ7WoNHA4TnNQU0VQdnM/view?usp=sharing

with:

$ cat $(pwd)/valid.src | $AMUN -c $(pwd)/toymodel/model.npz.amun.yml -m $(pwd)/toymodel/model.iter3000.npz -d 0 1 2 3 -b 12 -n --mini-batch 10 --maxi-batch 1000

[out]:

[Wed July 12 15:54:31 2017] (I) Options: allow-unk: false
beam-size: 12
cpu-threads: 0
devices: [0, 1, 2, 3]
gpu-threads: 1
log-info: true
log-progress: true
max-length: 500
maxi-batch: 1000
mini-batch: 10
n-best: false
no-debpe: false
normalize: true
relative-paths: false
return-alignment: false
return-soft-alignment: false
scorers:
  F0:
    path: /home/liling/toymodel/model.iter3000.npz
    type: Nematus
show-weights: false
softmax-filter:
  []
source-vocab: /home/liling/toymodel//vocab.src.yml
target-vocab: /home/liling/toymodel//vocab.trg.yml
weights:
  F0: 1
wipo: false

[Wed July 12 15:54:31 2017] (I) Loading scorers...
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling/toymodel/model.iter3000.npz onto gpu 0
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling//toymodel/model.iter3000.npz onto gpu 1
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling/toymodel/model.iter3000.npz onto gpu 2
[Wed July 12 15:54:31 2017] (I) Loading model /home/liling//toymodel/model.iter3000.npz onto gpu 3
[Wed July 12 15:54:35 2017] (I) Reading from stdin
[Wed July 12 15:54:35 2017] (I) Setting CPU thread count to 0
[Wed July 12 15:54:35 2017] (I) Setting GPU thread count to 1
[Wed July 12 15:54:35 2017] (I) Total number of threads: 4
[Wed July 12 15:54:35 2017] (I) Reading input
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  cudaFree in free: an illegal memory access was encountered
Aborted (core dumped)

My amun built come from 5454c11a471ec9e258a81b8e5ac4d5bb01816738

$ git log
commit 5454c11a471ec9e258a81b8e5ac4d5bb01816738
Author: Tomasz Dwojak <t.dwojak@amu.edu.pl>
Date:   Tue May 30 10:14:36 2017 +0000

    Remove redundant God in Scorer::Decode()
hieuhoang commented 7 years ago

Git pull

Ps. Your toy model has no toys

On 12 Jul 2017 8:56 a.m., "alvations" notifications@github.com wrote:

I'm getting the same error with this model: https://drive.google.com/file/ d/0Bzz3wLacJ7WocGd5LUlCbkYyNmM/view?usp=sharing

while running amun on the validation data: https://drive.google.com/file/d/0Bzz3wLacJ7WoNHA4TnNQU0VQdnM/ view?usp=sharing

with:

$ cat $(pwd)/valid.src | $AMUN -c $(pwd)/toymodel/model.npz.amun.yml -m $(pwd)/toymodel/model.iter3000.npz -d 0 1 2 3 -b 12 -n --mini-batch 10 --maxi-batch 1000

[out]:

[Wed July 12 15:54:31 2017] (I) Options: allow-unk: false beam-size: 12 cpu-threads: 0 devices: [0, 1, 2, 3] gpu-threads: 1 log-info: true log-progress: true max-length: 500 maxi-batch: 1000 mini-batch: 10 n-best: false no-debpe: false normalize: true relative-paths: false return-alignment: false return-soft-alignment: false scorers: F0: path: /home/ltan/IBot-Marian/toymodel/model.iter3000.npz type: Nematus show-weights: false softmax-filter: [] source-vocab: /home/ltan/IBot-Marian/toymodel//vocab.src.yml target-vocab: /home/ltan/IBot-Marian/toymodel//vocab.trg.yml weights: F0: 1 wipo: false

[Wed July 12 15:54:31 2017] (I) Loading scorers... [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 0 [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 1 [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 2 [Wed July 12 15:54:31 2017] (I) Loading model /home/ltan/IBot-Marian/toymodel/model.iter3000.npz onto gpu 3 [Wed July 12 15:54:35 2017] (I) Reading from stdin [Wed July 12 15:54:35 2017] (I) Setting CPU thread count to 0 [Wed July 12 15:54:35 2017] (I) Setting GPU thread count to 1 [Wed July 12 15:54:35 2017] (I) Total number of threads: 4 [Wed July 12 15:54:35 2017] (I) Reading input terminate called after throwing an instance of 'thrust::system::system_error' what(): cudaFree in free: an illegal memory access was encountered Aborted (core dumped)

I'm on 5454c11 https://github.com/marian-nmt/marian/commit/5454c11a471ec9e258a81b8e5ac4d5bb01816738 build

$ git log commit 5454c11a471ec9e258a81b8e5ac4d5bb01816738 Author: Tomasz Dwojak t.dwojak@amu.edu.pl Date: Tue May 30 10:14:36 2017 +0000

Remove redundant God in Scorer::Decode()

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian/issues/50#issuecomment-314685898, or mute the thread https://github.com/notifications/unsubscribe-auth/AAqOFKhBLn9dhHlBtJW8ETttSPurJXAiks5sNHw-gaJpZM4MX3Dn .

alvations commented 7 years ago

Whoops, my bad. Pulling and re-making to try again.

See new comment.

alvations commented 7 years ago

Building from 54e92cf57a40f6e053388d491005dd264ee58e54

$ git log
commit 54e92cf57a40f6e053388d491005dd264ee58e54
Author: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Date:   Tue Jul 11 18:55:23 2017 +0000

    reverted to older spdlog

The model (new link): https://drive.google.com/file/d/0Bzz3wLacJ7WoR0JSS3ZOeXc3T3M/view?usp=sharing

and the validation data valid.src: https://drive.google.com/file/d/0Bzz3wLacJ7WoNHA4TnNQU0VQdnM/view?usp=sharing

A similar error occurs on the command:

$ cat valid.src | \
> $AMUN -c toymodel/model.npz.amun.yml -m toymodel/model.iter1000.npz -b 12 -n --mini-batch 10 --maxi-batch 1000

[out]:

$ cat toydata/valid.src | \
> $AMUN -c toymodel/model.npz.amun.yml -m toymodel/model.iter1000.npz -b 12 -n --mini-batch 10 --maxi-batch 1000
[Wed July 12 16:42:11 2017] (I) Options: allow-unk: false
beam-size: 12
cpu-threads: 0
devices: [0, 1, 2, 3]
gpu-threads: 1
log-info: true
log-progress: true
max-length: 500
maxi-batch: 1000
mini-batch: 10
n-best: false
no-debpe: false
normalize: true
relative-paths: false
return-alignment: false
scorers:
  F0:
    path: toymodel/model.iter1000.npz
    type: Nematus
show-weights: false
softmax-filter:
  []
source-vocab: /home/liling/toymodel//vocab.src.yml
target-vocab: /home/liling/toymodel//vocab.trg.yml
weights:
  F0: 1
wipo: false

[Wed July 12 16:42:11 2017] (I) Loading scorers...
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 0
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 1
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 2
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 3
[Wed July 12 16:42:15 2017] (I) Reading from stdin
[Wed July 12 16:42:15 2017] (I) Setting CPU thread count to 
[Wed July 12 16:42:15 2017] (I) Setting GPU thread count to 
[Wed July 12 16:42:15 2017] (I) Total number of threads: 
[Wed July 12 16:42:15 2017] (I) Reading input
ERROR: an illegal memory access was encountered in /home/liling/marian/src/gpu/mblas/nth_element.cu at line 331
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  cudaFree in free: an illegal memory access was encounteredterminate called recursively

After re-running the amun command several times, sometimes it throws different error:


[Wed July 12 16:42:11 2017] (I) Loading scorers...
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 0
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 1
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 2
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 3
[Wed July 12 16:42:15 2017] (I) Reading from stdin
[Wed July 12 16:42:15 2017] (I) Setting CPU thread count to 
[Wed July 12 16:42:15 2017] (I) Setting GPU thread count to 
[Wed July 12 16:42:15 2017] (I) Total number of threads: 
[Wed July 12 16:42:15 2017] (I) Reading input
ERROR: an illegal memory access was encountered in /home/liling/marian/src/gpu/mblas/nth_element.cu at line 293
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  cudaFree in free: an illegal memory access was encounteredterminate called recursively

Sometimes no indication of the line no. but just termination:

[Wed July 12 16:42:11 2017] (I) Loading scorers...
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 0
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 1
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 2
[Wed July 12 16:42:11 2017] (I) Loading model toymodel/model.iter1000.npz onto gpu 3
[Wed July 12 16:42:15 2017] (I) Reading from stdin
[Wed July 12 16:42:15 2017] (I) Setting CPU thread count to 
[Wed July 12 16:42:15 2017] (I) Setting GPU thread count to 
[Wed July 12 16:42:15 2017] (I) Total number of threads: 
[Wed July 12 16:42:15 2017] (I) Reading input
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  cudaFree in free: an illegal memory access was encounteredterminate called recursively
hieuhoang commented 7 years ago

it runs for me with the master branch, on 2 gpus. And it give the same error as you on the un-updated branch (3d6_premerge).

This is the command I run, similar to your command: cat valid.src | $MARIAN/build/amun -c toymodel/model.npz.amun.yml -m toymodel/model.iter1000.npz -b 12 -n --mini-batch 10 --maxi-batch 1000 --devices 0 2

Make absolutely sure you are using the up to date code. cd build rm -rf * cmake .. make -j

hieuhoang commented 7 years ago

also try using 1 gpu, then 2, then 3, and let me know if it works

alvations commented 7 years ago

Thanks @hieuhoang !!

I didn't build properly previously, force removing the build and re-compiling + re-making worked =)

I've tried all parameters, it worked for all --devices settings.

hieuhoang commented 7 years ago

cheers. I'm gonna make a reg test with your model so it amun doesn't fall out of bed again. Howl if you object

alvations commented 7 years ago

No objections =)

Thank you again!!

On 12 Jul 2017 7:14 p.m., "Hieu Hoang" notifications@github.com wrote:

cheers. I'm gonna make a reg test with your model so it amun doesn't fall out of bed again. Howl if you object

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian/issues/50#issuecomment-314738821, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAGzEUvNJcWdiY9cs23pzaoUok2NdeSks5sNKqfgaJpZM4MX3Dn .

mkaeshammer commented 7 years ago

Finally found the time to test as well. No error anymore. Great work - thank you very much!!