openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.11k stars 417 forks source link

UCP Initialization Fail on Cray #3084

Open hjelmn opened 5 years ago

hjelmn commented 5 years ago

Right now UCP is trying to send a message of 145 bytes using the datagram protocol when the limit is 128 bytes. This leads to the following error:

[nid00036:38423:0:38423] ugni_udt_ep.c:178  UCX Assertion `msg_length <= 128' failed: msg_length=145

Just documenting the bug. I will not try to fix it.

hjelmn commented 5 years ago

@yosefe Who can fix this? I plan to just document this in Open MPI and point users at the documentation if they try to use UCX on a Cray.

shamisp commented 5 years ago

@yosefe @MattBBaker we fixed it in the past ? Can we have hardcoded assert when people bloat it ? @hjelmn FYI, LANL will be adding full UCX regression for Cray. One way or another we have to fix this.

MattBBaker commented 5 years ago

Unfortunately the assert only helps if you're trying to hit it with ugni. The UCP wireup protocol will have to be fixed to do something smart when it's aux transport is too small. I see some code in there sketching out something for a multipart wireup?

shamisp commented 5 years ago

It was discussed during f2f. I think we have path forward to fix this.

hppritcha commented 5 years ago

@shamisp we'd be interested in helping to test any fix there may be for this. Should #3231 help?

shamisp commented 5 years ago

@hppritcha it was discussed on the call and the answer is no. UCP or SMSG exchange protocols and data structures have to be updated.

hppritcha commented 5 years ago

@snyjm-18 and I discussed this some and since

we decided to not pursue a solution on the Cray systems at this time.

shamisp commented 5 years ago

@hppritcha - can you please provide more information ? Since homogenous wire up was introduced in UCX we can easily drop most of the info from the packet header without any re-design. Probably between 50-100 COL ?

hppritcha commented 5 years ago

@shamisp when was this added? Do we need to do anything special to enable homogenous wire-up?

shamisp commented 5 years ago

It was added while ago to UCP. It reduces UCP header but it is not enough. I think we can pick this up at ugni UCT and strip off all the unnecessary info. With UGNI we know it is single transport, the same BW latency everywhere, etc.

snyjm-18 commented 5 years ago

@hjelmn Can you send me your configure line for ompi?

snyjm-18 commented 5 years ago

I have been looking into this. What is unified mode? It looks like it can allow for a smaller wireup message size on homogenous systems, which would save us 45+ bytes. I assume a cray would qualify as a homogenous system. I couldn't find where this is set though. @yosefe @shamisp

snyjm-18 commented 5 years ago

Used the environment variable to set unified mode, but now it just hangs.