Open hjelmn opened 5 years ago
@yosefe Who can fix this? I plan to just document this in Open MPI and point users at the documentation if they try to use UCX on a Cray.
@yosefe @MattBBaker we fixed it in the past ? Can we have hardcoded assert when people bloat it ? @hjelmn FYI, LANL will be adding full UCX regression for Cray. One way or another we have to fix this.
Unfortunately the assert only helps if you're trying to hit it with ugni. The UCP wireup protocol will have to be fixed to do something smart when it's aux transport is too small. I see some code in there sketching out something for a multipart wireup?
It was discussed during f2f. I think we have path forward to fix this.
@shamisp we'd be interested in helping to test any fix there may be for this. Should #3231 help?
@hppritcha it was discussed on the call and the answer is no. UCP or SMSG exchange protocols and data structures have to be updated.
@snyjm-18 and I discussed this some and since
we decided to not pursue a solution on the Cray systems at this time.
@hppritcha - can you please provide more information ? Since homogenous wire up was introduced in UCX we can easily drop most of the info from the packet header without any re-design. Probably between 50-100 COL ?
@shamisp when was this added? Do we need to do anything special to enable homogenous wire-up?
It was added while ago to UCP. It reduces UCP header but it is not enough. I think we can pick this up at ugni UCT and strip off all the unnecessary info. With UGNI we know it is single transport, the same BW latency everywhere, etc.
@hjelmn Can you send me your configure line for ompi?
I have been looking into this. What is unified mode? It looks like it can allow for a smaller wireup message size on homogenous systems, which would save us 45+ bytes. I assume a cray would qualify as a homogenous system. I couldn't find where this is set though. @yosefe @shamisp
Used the environment variable to set unified mode, but now it just hangs.
Right now UCP is trying to send a message of 145 bytes using the datagram protocol when the limit is 128 bytes. This leads to the following error:
Just documenting the bug. I will not try to fix it.