Closed whuLames closed 1 month ago
Sorry for late reply.
I think segment_coo
can be indeed faster for larger group sizes since it has a better parallelization scheme.
On a high-level, segment_csr
parallelizes across groups, which may result in bad utilization in case the number of groups is imbalanced or large. On the other hand, segment_coo
parallelizes across the number of input elements. It then performs a parallel reduction inside a single warp (32 threads) to accumulate the result and to only write once via atomic ops.
This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?
My recent application uses the
segment_csr
andsegment_coo
methods, and I can convert the data in csr format to the input format required bysegment_coo
without a hitch, and In most cases,segment_coo
is faster thansegment_csr
.At the macro level, I think
segement_coo
itself requires the input index to be ordered, which may be the reason whysegment_coo
performance is better thansegment_csr
, but further, whether it is possible to provide a low-level implementation perspective (such as multi-threaded access conflicts, etc.) to explain the performance difference between the two functions.And for a given src and index, is it possible to split them into multiple parts and call segment_coo separately to reduce thread collisions and possibly achieve an overall performance improvement ?
Last question, there are different implementations of torch_scatter for different pytorch versions. Do the newer versions have significant performance improvements?
Thank you, Looking forward to your reply !!