google-deepmind / tapnet

Tracking Any Point (TAP)
https://deepmind-tapir.github.io/blogpost.html
Apache License 2.0
1.3k stars 124 forks source link

TAPIR performance degradation with cudnn9 #99

Open cdoersch opened 4 months ago

cdoersch commented 4 months ago

Internally running with cudnn9 results in poor TAPIR performance. It's unclear if anyone external has encountered the same issue. Our teams have traced the issue to a broken cudnn9 convolution kernel. This is being tracked in the following bug at nvidia:

https://partners.nvidia.com/bug/viewbug/4705291

bhack commented 4 months ago

Any news on this? I don't know if we have visibility on this upstream ticket. Can you share something more about the performance. Is it about accuracy or speed?

cdoersch commented 4 months ago

Same speed, catastrophic collapse in accuracy. If you look at the results, the failure will be obvious.

We suspect that CUDA is reading/writing memory that doesn't belong to the tensor it's supposed to be reading/writing, leading to garbage in the network.

bhack commented 4 months ago

Is it related to a specific cudnn9 version?