ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
415 stars 90 forks source link

CUDA error with batched solvers #1576

Closed tpadioleau closed 6 months ago

tpadioleau commented 8 months ago

Using Ginkgo batched solvers I sometimes get the following errors

terminate called after throwing an instance of 'gko::CudaError'
  what():  /tmp/soft/spack-stage/spack-stage-ginkgo-1.7.0-a6wuqlgaxa2t5vkyzdapmgicl73ujdjd/spack-src/cuda/base/executor.cpp:203: synchronize: cudaErrorMisalignedAddress: misaligned address
Abandon

This error appears on Ginkgo 1.7.0 release on a V100 GPU with CUDA 12.2.1.

You can find a minimal reproducer in this repository https://github.com/Geoflow/gko_minreproducer/tree/minimal-reproducer. In my case I got the error with a batch size of 1. Am I doing something wrong ?

cc @Geoflow

pratikvn commented 8 months ago

@tpadioleau , thank you for the report. This was indeed an issue with the way were logging the iteration counts. This should now be fixed in #1578. Let me know in case that branch does not work for you.

tpadioleau commented 8 months ago

thank you very much for the quick fix!

tpadioleau commented 8 months ago

I have 2 more questions:

  1. In the current 1.7.0 version, is there a trick to make it work ? Changing precision, integer type ?
  2. Do you plan a patch for version 1.7 or provide the bug fix when releasing 1.8 ?
pratikvn commented 8 months ago

This will definitely be fixed in 1.8.0. But one question from my side would be if it would be possible for you to depend on develop rather on a specific version ? It would allow you to get access to the latest features and bug fixes.

In any case, we plan to release 1.8.0 soon, probably in the next few weeks, so I dont think we will patch 1.7, unless it is a strong requirement from your side.

tpadioleau commented 8 months ago

I think it should be ok for us to wait 1.8.0. I would prefer to avoid the develop branch in case users notice different behaviour in the results of simulations.