Fixes to work with CUDA 12 toolkit

I am going to commit this quite soon as these bugs have been causing users a lot of problems since the CUDA 12 toolkit came out. Issues:

They seem to have changed the implementation of cusparse_csr2csc so that it requires the array of column indexes to be aligned on a considerably larger boundary than the 4 bytes (the size of int) that the documentation claims.
They seem to have introduced some optimizations so that even on older (pre-Volta) hardware, threads are no longer implicitly synchronized within a warp, and we need to do __syncwarp().

kaldi-asr / kaldi