How to evaluate segment_csr and segment_coo ？

whuLames commented 7 months ago

My recent application uses the segment_csr and segment_coo methods, and I can convert the data in csr format to the input format required by segment_coo without a hitch, and In most cases, segment_coo is faster than segment_csr.

At the macro level, I think segement_coo itself requires the input index to be ordered, which may be the reason why segment_coo performance is better than segment_csr, but further, whether it is possible to provide a low-level implementation perspective (such as multi-threaded access conflicts, etc.) to explain the performance difference between the two functions.

And for a given src and index, is it possible to split them into multiple parts and call segment_coo separately to reduce thread collisions and possibly achieve an overall performance improvement ?

Last question, there are different implementations of torch_scatter for different pytorch versions. Do the newer versions have significant performance improvements?

Thank you, Looking forward to your reply !!

rusty1s commented 7 months ago

Sorry for late reply.

I think segment_coo can be indeed faster for larger group sizes since it has a better parallelization scheme.

On a high-level, segment_csr parallelizes across groups, which may result in bad utilization in case the number of groups is imbalanced or large. On the other hand, segment_coo parallelizes across the number of input elements. It then performs a parallel reduction inside a single warp (32 threads) to accumulate the result and to only write once via atomic ops.

github-actions[bot] commented 1 month ago

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?

rusty1s / pytorch_scatter

How to evaluate segment_csr and segment_coo ？ #402