FastSortFusedNew Function Hangs on Compute Capability 7.5 (Q6000) but Runs Fine on Compute Capability 8.6 (3090/A4500)

I have converted a C++ codebase into a PyTorch extension, and it runs perfectly on GPUs with Compute Capability 8.6, specifically on the RTX 3090 and A4500. However, when testing on a Quadro RTX 6000 with Compute Capability 7.5, the FastSortFusedNew function hangs. The function either stalls upon first entry or hangs immediately.

Details:

PyTorch Version: 2.1.0 CUDA Version: 11.8 Operating System (working): Ubuntu 20.04 Operating System (failing): Ubuntu 18.04 or 22.04 I suspect the issue might not be related to the OS version since I encountered the same problem on both Ubuntu 18.04 and 22.04. The function runs without issues on the same codebase on GPUs with Compute Capability 8.6.

Has anyone experienced a similar issue, or does anyone have insights into why this might be happening?

lfranke / TRIPS

FastSortFusedNew Function Hangs on Compute Capability 7.5 (Q6000) but Runs Fine on Compute Capability 8.6 (3090/A4500) #55