ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
414 stars 88 forks source link

Half factorization #1712

Open yhmtsai opened 1 month ago

yhmtsai commented 1 month ago

this pr adds the factorization with half support.

Hip does not support atomic on the 16bits type currently

NVHPC 23.3 seems to handle assignment index with optimization wrongly on a custom class when IndexType is long. We set the index explicitly with volatile to solve it. NVHPC24.1 seem to fixed this issue. https://godbolt.org/z/srYhGndKn

TODO:

upsj commented 1 week ago

For HIP 16 bit atomics, as long as you only use load and store, you could implement them as

yhmtsai commented 1 week ago

using 32 bit memory operation for 16 bit, it will have illegal memory access in the tail or head if we do not handle it in a upper level.

upsj commented 1 week ago

Theoretically that would be an easy fix: Make sure all allocations are at least 32 bits and rounded up to multiples of 4. But I believe most allocators already silently fulfill that assumption, and GPUs are unlikely to have 16 bit allocation boundaries for alignment purposes.

yhmtsai commented 1 week ago

I do not like slight guarantee unless we have a way to ensure or at least check. However, I would suggest we do not consider it for this pr and release such that we have enough time ensure that it works correctly on hip.

upsj commented 1 week ago

I can give you a somewhat technical justification for this: cudaMalloc returns correctly aligned memory for thrust::complex<double>, despite not knowing anything about the type. So that means that the allocator is not using any space between those 16 byte-aligned allocations. Whether this is special-cased for allocations divisible by 16 or not I'm not sure (I would assume not, since people also allocate memory pools themselves), but again, we have an easy fix, which I would honestly consider useful in any case: round up the sizes raw_allow uses to at least be divisible by 4.

yhmtsai commented 1 week ago

I know the idea, sometimes it is necessary for optimized half precision by packing them (so, we will have kind of natively 32 bit by enforcing packing structure requirement) I will still say it is not easy and confident to say it will be correct in this short period. For example, user allocates some memory with 16 bits type but only pass the odd number to array_view. should we accept or throw the error? Of course, these memory operation will not change the value out of the actual array, but it is still illegal memory operation.