Open cdoersch opened 4 months ago
Any news on this? I don't know if we have visibility on this upstream ticket. Can you share something more about the performance. Is it about accuracy or speed?
Same speed, catastrophic collapse in accuracy. If you look at the results, the failure will be obvious.
We suspect that CUDA is reading/writing memory that doesn't belong to the tensor it's supposed to be reading/writing, leading to garbage in the network.
Is it related to a specific cudnn9 version?
Internally running with cudnn9 results in poor TAPIR performance. It's unclear if anyone external has encountered the same issue. Our teams have traced the issue to a broken cudnn9 convolution kernel. This is being tracked in the following bug at nvidia:
https://partners.nvidia.com/bug/viewbug/4705291