openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

`ucc_context_create` seems to take a constant time #881

Open nirandaperera opened 7 months ago

nirandaperera commented 7 months ago

Hi, I am trying to see the overhead of creating a ucc communication context in a 4 node cluster with 64 slots/node. AVERAGE of create ctx and AVERAGE of destroy ctx

This is a graph of calling ucc_context_create and ucc_context_destroy for 50 iterations and the average time spent on each operation. I was surprised to see that context creation takes constant time across the 4 nodes. Is this the expected behavior?

I measured the timings for the oob allgather operation (time between when the request was created and when the request completes) and it doesn't come close to this 3s mark. Can anyone shed some light on this?

Sergei-Lebedev commented 7 months ago

This is a graph of calling ucc_context_create and ucc_context_destroy for 50 iterations and the average time spent on each operation. I was surprised to see that context creation takes constant time across the 4 nodes. Is this the expected behavior?

ucc_context create time depends on multiple factors, e.g. what UCC TLs you are using and whether it's global context or not. It's expected if you run with TL UCP only, since we don't connect endpoints in advance and instead connection between peers is established only if it's really needed.

I measured the timings for the oob allgather operation (time between when the request was created and when the request completes) and it doesn't come close to this 3s mark. Can anyone shed some light on this?

Again it depends on TLs being used. Context create for TL UCP includes initializing UCP context and UCP worker.