prov/util, efa: discard outstanding operations in util_srx_close

shijin-aws commented 3 months ago

Currently, util_srx_close call util_cancel_recv for the outstanding untagged and tagged recv posts, which doesn't comply with the man page for ep close that outstanding operations posted to the endpoint when fi_close is called will be discarded. Discarded operations will silently be dropped, with no completions reported.

shijin-aws commented 3 months ago

After reading man page, I think even this fix is not correct

https://ofiwg.github.io/libfabric/v1.21.0/man/fi_endpoint.3.html

Outstanding operations posted to the endpoint when fi_close is called will be discarded. Discarded operations will silently be dropped, with no completions reported. Additionally, a provider may discard previously completed operations from the associated completion queue(s). The behavior to discard completed operations is provider specific.

Shouldn't we just discard instead of canceling (generating cq err) here?

aingerson commented 3 months ago

@shijin-aws I think you're right, we should just be able to discard it. We probably don't even need to iterate over the queues - as long as we free the memory in the bufpool itself, we can just ignore the dlist cleanup. However, we might have a slight problem with the wording of the man page and the util_srx/peer api implementation:

Canceling an operation causes the fabric provider to search for the operation and, if it is still pending, complete it as having been canceled. An error queue entry will be available in the associated error queue with error code FI_ECANCELED. On the other hand, if the operation completed before the call to fi_cancel, then the completion status of that operation will be available in the associated completion queue.

The man page is a little unclear at what exactly "pending" means and suggest that there are two states of a receive - "pending" and "completed" which gets a little fuzzy with the asynchronous peer provider flow. Since the peer provider srx doesn't have access to the rx entry when the peer is processing the incoming command, there is a gray area here. It can't find a match so we assume it can't be canceled, but hasn't exactly "completed" yet and might not be available in the cq like the man page suggests.

We need to either add some language here to specify that the entry might not be completed but unable to be canceled OR add tracking in the peer srx implementation to keep a list of entries that are in peer space to be able to search that list as well. I think I fall more into the second category. Shouldn't be too hard - an extra slist and search and then call the srx->discard() function should do it. But we may need to add a specification that the owner can update the flags during the discard to specify that it should get written to the CQ with that flag. Could get a little spicy.

@j-xiong Do you have thoughts on this? Do you think we can treat an entry that has moved into peer space for processing as not being able to be canceled yet? Or do you think we should fix the util srx implementation to be able to update an entry that has moved to peer space?

j-xiong commented 3 months ago

When the rx entry has been moved to the peer for processing, we could consider the rx entry being consumed and not being able to be cancelled any more.

shijin-aws commented 3 months ago

I think you're right, we should just be able to discard it. We probably don't even need to iterate over the queues - as long as we free the memory in the bufpool itself, we can just ignore the dlist cleanup.

But if your rx pool is not created with OFI_BUFPOOL_NO_TRACK, you will hit assertion error (use_cnt) during bufpool destroy. So explicit ofi_buf_free is still needed here.

aingerson commented 3 months ago

@shijin-aws Yeah you'll still need the buf_free

shijin-aws commented 3 months ago

Ideally it should be tested by a general fabtests, but for the time being I just added a unit test for EFA provider. @aingerson could you review the first commit, thx!

j-xiong commented 3 months ago

how about util_cleanup_queues called a few lines above (L928-L929)?

shijin-aws commented 3 months ago

how about util_cleanup_queues called a few lines above (L928-L929)?

Good catch, will update

shijin-aws commented 3 months ago

Is Intel CI failure related?

j-xiong commented 3 months ago

SHMEM test mt_lock_test failed.

16:18:13  Running sos mt_lock_test
16:18:13  bash -c "export SHMEM_OFI_USE_PROVIDER=tcp; export SHMEM_SYMMETRIC_SIZE=4G; export SHMEM_DEBUG=1;  oshrun -n 2 -ppn 1 mt_lock_test"
16:18:13  None
16:18:13  Sandia OpenSHMEM 1.5.2
16:18:13  Build information:
16:18:13    Configure Args        '--disable-fortran' '--enable-pmi-simple'
16:18:13                          [0001] DEBUG: init.c:355: shmem_internal_init
16:18:13  [0001]        Thread level=MULTIPLE, Num. PEs=2
16:18:13  [0001]        Sym. heap=0x80000000 len=4296015872 -- data=0x603118 len=104
16:18:13  '--[0001] DEBUG: init.c:389: shmem_internal_init
16:18:13  [0001]        Affinity to 88 processor cores: { 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16:18:13                16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
16:18:13                40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
16:18:13                64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
16:18:13                }
16:18:13  enable-hard-polling' '--enable-manual-progress'
16:18:13    Build Date            Tue Jul  2 16:12:43 PDT 2024
16:18:13    Build CC              gcc
16:18:13    Build CFLAGS          -std=gnu11 -g -O2 -Wall -rdynamic -fvisibility=hidden
16:18:13  
16:18:13  [0000] DEBUG: transport_ofi.c:1308: query_for_fabric
16:18:13  [0000]        OFI provider: tcp;ofi_rxm, fabric: 192.168.20.0/24, domain: eth2, mr_mode: 0x0
16:18:13  [0000]        max_inject: 16, max_msg: 18446744073709551615, stx: no, stx_max: 0
16:18:13  [0001] DEBUG: transport_ofi.c:1308: query_for_fabric
16:18:13  [0001]        OFI provider: tcp;ofi_rxm, fabric: 192.168.20.0/24, domain: eth2, mr_mode: 0x0
16:18:13  [0001]        max_inject: 16, max_msg: 18446744073709551615, stx: no, stx_max: 0
16:18:13  [0000] DEBUG: transport_ofi.c:1124: allocate_fabric_resources
16:18:13  [0000]        OFI version: built 1.21, cur. 1.21; provider version: 121.0
16:18:13  [0001] DEBUG: transport_ofi.c:1124: allocate_fabric_resources
16:18:13  [0001]        OFI version: built 1.21, cur. 1.21; provider version: 121.0
16:18:13  [0000] DEBUG: init.c:419: shmem_internal_init
16:18:13  [0000]        Local rank=0, Num. local=1, Shr. rank=-1, Num. shr=0
16:18:13  [0001] DEBUG: init.c:419: shmem_internal_init
16:18:13  [0001]        Local rank=0, Num. local=1, Shr. rank=-1, Num. shr=0
16:18:13  [0000] DEBUG: shmem_team.c:123: shmem_internal_team_init
16:18:13  [0000]        SHMEM_TEAM_SHARED: start=0, stride=1, size=1
16:18:13  [0001] DEBUG: shmem_team.c:123: shmem_internal_team_init
16:18:13  [0001]        SHMEM_TEAM_SHARED: start=1, stride=1, size=1
16:18:13  Starting MT locking test on 2 PEs, 4 threads/PE
16:18:13  [0000] DEBUG: transport_ofi.c:1719: shmem_transport_ctx_destroy
16:18:13  [0000]        id = -1, options = 0, stx_idx = -1
16:18:13  [0000]        pending_put_cntr =        76, completed_put_cntr =        76
16:18:13  [0000]        pending_get_cntr =       111, completed_get_cntr =       111
16:18:13  [0000]        pending_bb_cntr  =         0, completed_bb_cntr  =         0
16:18:13  Succes[0001] DEBUG: transport_ofi.c:1719: shmem_transport_ctx_destroy
16:18:13  [0001]        id = -1, options = 0, stx_idx = -1
16:18:13  [0001]        pending_put_cntr =        76, completed_put_cntr =        76
16:18:13  [0001]        pending_get_cntr =       108, completed_get_cntr =       108
16:18:13  [0001]        pending_bb_cntr  =         0, completed_bb_cntr  =         0
16:18:13  s
16:18:13  1: Error expected 32, got 31
16:18:13  Return code is 1
16:18:13  exiting with 1

shijin-aws commented 3 months ago

@j-xiong do you think the error can be related? I seems it uses tcp provider, and tcp doesn't use util srx

j-xiong commented 3 months ago

Probably unrelated. The last push was efa test changes plus rebase and CI did pass for the push before that. I restarted the CI run anyway.

ofiwg / libfabric

prov/util, efa: discard outstanding operations in util_srx_close #10139