ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
67 stars 29 forks source link

CUDA 11.8 + `bifrost.fir` = error #201

Closed jaycedowell closed 1 year ago

jaycedowell commented 1 year ago

I was testing under CUDA 11.8 and I kept running into a:

fir.cu:248 cudaGetErrorString(cuda_ret) = too many resources requested for launch
fir.cu:248 Condition failed: cuda_ret == cudaSuccess
fir.cu:248 error 99: BF_STATUS_INTERNAL_ERROR
fir.cu:413 error 99: BF_STATUS_INTERNAL_ERROR

This error disappears under CUDA 11.6 and everything acts the way it should.

telegraphic commented 1 year ago

Hey Jayce,

Could be related this this tale from Matt: "Long story short, this was failing in the hashpipe rawspec thread and the solution was we had to downgrade from cuda 11.8 to 11.7 - thanks to Luigi (who happened to also be traveling here at GB) for pointing out that 11.8 has bugs that would cause CUFFT failures. Once Dave downgraded, we were recording data again."

telegraphic commented 1 year ago

Not sure what "bugs" are however, but CUDA 11.8 is forsaken...

jaycedowell commented 1 year ago

Interesting.

I was reading some of the CUDA documentation and there is mention of using __launch_bounds()__ to help guide the compiler on not overutilizing registers. I don't know if that will help in this case but it might be worth looking into at some point.

jaycedowell commented 1 year ago

Since we are working on CUDA 12 support trying to fight with 11.8 isn't worth it. If this pops up again then maybe we add a catch to configure to throw an error if 11.8 is found.