Open LauritzBrandt19116 opened 7 months ago
Same problem here, non-factored models work, factored models (both source and target factors) fail with the same error, our configuration is newest marian-dev and
NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4
I have the same issue. Given you have been waiting for 3 weeks with no response from developers, I think it is fair to assume that Marian is not being supported anymore.
@kpu @snukky were you able to look into this already?
I don't have commit access. If @mjpost wants to claim Marian is still maintained https://x.com/mjpost/status/1799130562344656901 he should address this issue.
@hieuhoang is still fixing bugs in Moses!
Bug description
Marian 1.12 (
65bf82ffce52f4854295d8b98482534f176d494e
) runs into this error for target factored data:How to reproduce
Run marian 1.12 compiled against CUDA 11+ with target factors.
I am trying to train marian models from scratch using factored data. It succeeds for source factors, but source-and-target factors or target factor trainings fail the CUBLAS check.
I compile
65bf82ffce52f4854295d8b98482534f176d494e
in a docker container and have tried this with a set of cuda-, nvidia- and marian-versions on ubuntu 22.04 and 18.04 Variants that were tried:Context
Marian output
marian version (in the docker environment)
nvidia-smi output
host system 1
host system 2
failing marian 1.12 cuda 12.3 docker container on host 1
working marian 1.11 cuda 10.2 docker container on host 1
failing marian 1.12 cuda 12.3 docker container on host 2
working marian 1.11 cuda 10.2 docker container on host 2
I notice the CUDA versions that nvidia-smi outputs seem to be whatever is higher, host system or docker CUDA, but all containers have been build to run the packed cuda.