ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Open IB BTL error spam (retry_exceeded_error) #16

Closed abouteiller closed 6 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


[[43897,1],10][../../../../../trunk/ompi/mca/btl/openib/btl_openib_component.c:3504:handle_wc] from d /home/bouteill/ompi/FTMPI3/RTS-ULFM/run/share/openmpi/amca-param-sets:/home/bouteill/ompi/FTMPI3/ul│ancer01 to: dancer03 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id -------------------------------------------------------------------------- │13dec80 opcode 128 vendor error 129 qp_idx 0 
-------------------------------------------------------------------------- │[[43897,1],10][../../../../../trunk/ompi/mca/btl/openib/btl_openib_component.c:3504:handle_wc] from d Process 23584 Unable to locate the parameter file "enable-ft-mpi" in the following search path: │ancer01 to: dancer03 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for w /home/bouteill/ompi/FTMPI3/RTS-ULFM/run/share/openmpi/amca-param-sets:/home/bouteill/ompi/FTMPI3/ul│r_id 1663300 opcode 128 vendor error 244 qp_idx 0
abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


This is not a harmless error message. While this "retry" activity is ongoing progress stalls.

abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


too many retries sending message to 0x0003:0x00d00412, giving up
abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Progress stalls completely after such reports arrive. Needs to be fixed.

abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Principal issue resolved in 04f61d22 and 79aca0bb : credits were not returned from communication with failed procs.

Issue remains for UDCM failures (and presumably for rdmacm as well).

abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


UDCM issue resolved in 2fb5440a RDMACM does not appear to have a similar code path, the issue could not be exhibited on our EDR hardware.