marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.25k stars 233 forks source link

Error: Curand error 203 occured after loading the vc list #292

Closed wang357911 closed 4 years ago

wang357911 commented 5 years ago

I use the costom boost-1.58 and CUDA-9.2. ubuntu16.04, it is totally same with the document,but while i train a basic model on my own dataset, error occors as

Error: Curand error 203 - /home/anylangtech/.userdata/bowang/code/marian/src/tensors/rand.cpp:75: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT) [2019-09-20 18:17:47] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /home/anylangtech/.userdata/bowang/code/marian/src/tensors/rand.cpp:75

[CALL STACK] [0x954201]
[0x954cb8]
[0x9532b4]
[0x952a2c]
[0x5c68eb]
[0x4e2ec7]
[0x4e34fb]
[0x4f888c]
[0x42e214]
[0x40c0da]
[0x7fa036cb7830] __libc_start_main + 0xf0 [0x42b7f9]

Aborted

I found somebody says it is because I didn't init the curandCreateGenerator,but it didn't work.

johnjcamilleri commented 4 years ago

Not sure if related, but I was getting this error because I was running Marian inside a container with the docker run command rather than nvidia-docker run. Putting here in case anyone else comes across this.

hieuhoang commented 4 years ago

i had the same problem and fixed it by updating the nvidia driver (430 to 440). The same thing happened a few year ago https://github.com/marian-nmt/marian-dev/issues/360

wangxw1023 commented 4 years ago

Not sure if related, but I was getting this error because I was running Marian inside a container with the docker run command rather than nvidia-docker run. Putting here in case anyone else comes across this.

Hi, I meet the same problems with you, may I know what's your docker version? Since before version 19.03, we need to use nvidia-docker, but after this version, we can just use docker run. we still meet this problem.

srdecny commented 3 years ago

For what it's worth, I just encountered this issue with Docker 20.10.6 and Nvidia 460 drivers, when attempting to run Marian as a service with docker-compose, using the new deploy: resources: reservations: devices: driver: nvidia syntax in the compose file. I can't downgrade to 450 drivers, because I am running an RTX 3060 and the "bleeding-edge" 465 drivers from Nvidia's site result in the same error, too.

Changing the docker-compose definition back to the old style of runtime: nvidia seemed to work (with 465 drivers).

lefterav commented 3 years ago

Hi, I have the same problem in an RTXA6000 with 465 drivers. But I am not using a docker-compose, just a dockerfile, where I define FROM nvidia/cuda:11.3.0-devel-ubuntu20.04 . @srdecny Could you elaborate what you mean the old style, and if it applies to the dockerfile?

srdecny commented 3 years ago

@lefterav Sorry, I was a bit unclear. I was talking about how to assign the GPU to a service in the docker-compose specification.

The legacy "old style" is this one (simply specifying runtime: nvidia: https://docs.docker.com/compose/gpu-support/#use-of-service-runtime-property-from-compose-v23-format-legacy

The new style (the deploy: resources: (...) which did not work for me): https://docs.docker.com/compose/gpu-support/#enabling-gpu-access-to-service-containers

If you're not using docker-compose, adding --gpus all flag to docker run should probably do the trick.

lefterav commented 3 years ago

Hm it is even more complicated, I am using a SLURM cluster with enroot sqfs containers, no docker involved. So it is not obvious how this workaround would apply

hashemsellat commented 3 years ago

I'm facing this error when building the docker image, I'm using Nvidia driver version 455.32.00 I tried nvidia-docker but didn't work

hashemsellat commented 3 years ago

It sounds docker doesn't detect the driver while building even when using nvidia-docker.
I solved the issue by using ENTRYPOINT docker instruction instead of RUN and then running the image using the command:
docker run -it --rm --gpus all myimage

tenebrius commented 1 year ago

I got this error while launching an instance and immediately running my marian docker. Turns out I had to give it about 10 seconds for something to fully load the drivers, then i could run the docker with no problem

wang357911 commented 1 year ago

您的邮件已收到,谢谢。

wyfdgg commented 4 months ago

Not sure if related, but I was getting this error because I was running Marian inside a container with the docker run command rather than nvidia-docker run. Putting here in case anyone else comes across this.

Hi, I meet the same problems with you, may I know what's your docker version? Since before version 19.03, we need to use nvidia-docker, but after this version, we can just use docker run. we still meet this problem.

I run the same marian docker with 'docker run' command on two different server, and the docker version is 19.03 . One server is working properly, but the other is getting the error.

wang357911 commented 4 months ago

您的邮件已收到,谢谢。