Half factorization - Githubissues

ginkgo-project / ginkgo

Numerical linear algebra software package

https://ginkgo-project.github.io/

BSD 3-Clause "New" or "Revised" License

414 stars 88 forks source link

Half factorization #1712

Open yhmtsai opened 1 month ago

yhmtsai commented 1 month ago

this pr adds the factorization with half support.

Hip does not support atomic on the 16bits type currently

NVHPC 23.3 seems to handle assignment index with optimization wrongly on a custom class when IndexType is long. We set the index explicitly with volatile to solve it. NVHPC24.1 seem to fixed this issue. https://godbolt.org/z/srYhGndKn

TODO:

[x] add the fix of tri solve with half

upsj commented 1 week ago

For HIP 16 bit atomics, as long as you only use load and store, you could implement them as

a 32 bit load_* plus a memcpy and,
a 32 bit load_* plus an atomic CAS, similar to how we did atomicAdd in the past. For safety, we need to execute this twice, assuming that every memory location only ever gets written once (which is true for all algorithms that use atomics), because your write can either fail because the upper half changed, or the lower half changed, one of which belongs to you and can't change without your knowledge.

yhmtsai commented 1 week ago

using 32 bit memory operation for 16 bit, it will have illegal memory access in the tail or head if we do not handle it in a upper level.

upsj commented 1 week ago

Theoretically that would be an easy fix: Make sure all allocations are ~~at least 32 bits and~~ rounded up to multiples of 4. But I believe most allocators already silently fulfill that assumption, and GPUs are unlikely to have 16 bit allocation boundaries for alignment purposes.

yhmtsai commented 1 week ago

I do not like slight guarantee unless we have a way to ensure or at least check. However, I would suggest we do not consider it for this pr and release such that we have enough time ensure that it works correctly on hip.

upsj commented 1 week ago

I can give you a somewhat technical justification for this: cudaMalloc returns correctly aligned memory for thrust::complex<double>, despite not knowing anything about the type. So that means that the allocator is not using any space between those 16 byte-aligned allocations. Whether this is special-cased for allocations divisible by 16 or not I'm not sure (I would assume not, since people also allocate memory pools themselves), but again, we have an easy fix, which I would honestly consider useful in any case: round up the sizes raw_allow uses to at least be divisible by 4.

yhmtsai commented 1 week ago

I know the idea, sometimes it is necessary for optimized half precision by packing them (so, we will have kind of natively 32 bit by enforcing packing structure requirement) I will still say it is not easy and confident to say it will be correct in this short period. For example, user allocates some memory with 16 bits type but only pass the odd number to array_view. should we accept or throw the error? Of course, these memory operation will not change the value out of the actual array, but it is still illegal memory operation.