Closed jimwu6 closed 2 years ago
The error is here https://github.com/NVIDIA/FasterTransformer/blob/dev/v5.0_beta/src/fastertransformer/utils/cuda_utils.h#L314. Namely, caused by cuda function directly. I think the problem is like you say, the mismatch of driver. Although we have driver in docker, you still need to install matched driver outside the docker.
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.
I've run into a situation where I will get this error.
To reproduce this I first built the image as described in the README on the
dev/v1.1_beta
branch with Triton version 21.08. Then I run the container as suchand then go into the container with
docker exec -it ft /bin/bash
.Finally I run the binary to start the server
If I start the server very soon after the container is started (likely somewhere < 5s), I will get this error. To solve this locally, I run the server binary again, and it works.
However, if after the container starts, I wait some more time (10s seems to work reliably) before running the binary, this error will not appear.
Searching on the internet this seems like some mismatch of Nvidia drivers (e.g. CUDA) but I find this weird because the drivers should be contained on the Docker image anyways.
I can reproduce this on A100s and T4s as well.