I encountered an error in the get_pathrecord_info function of ompi/mca/btl/openib/connect/btl_openib_connect_sl.c. For larger numbers (in my case >4) of MPI processes I see the following errors:
[[34818,1],0][connect/btl_openib_connect_sl.c:238:get_pathrecord_info] error posting receive on QP [0x3a0050] errno says: Success [0]
The return value of ibv_post_recv is '12' (ibverbs returns the error code instead of setting errno, therefor we get the 'Success [0]').
ibv_post_recv is called multiple times for the same QP (sa_qp_cache), which was set up in init_device() in the same file. So, for every SL query the get_pathrecord_info functions adds one WR to the SA_QP until its queue is full.
Moving the first ibv_post_recv call:
struct ibv_recv_wr *brwr;
from the get_pathrecord_info function to the end of init_device() solved the problem for me. (But I'm not sure if this is the appropriate position for the call.)
I saw this problem when I worked with OMPI 1.6.3, but the trunk does have the same bug.
Hello,
I encountered an error in the get_pathrecord_info function of ompi/mca/btl/openib/connect/btl_openib_connect_sl.c. For larger numbers (in my case >4) of MPI processes I see the following errors: [[34818,1],0][connect/btl_openib_connect_sl.c:238:get_pathrecord_info] error posting receive on QP [0x3a0050] errno says: Success [0]
The return value of ibv_post_recv is '12' (ibverbs returns the error code instead of setting errno, therefor we get the 'Success [0]').
ibv_post_recv is called multiple times for the same QP (sa_qp_cache), which was set up in init_device() in the same file. So, for every SL query the get_pathrecord_info functions adds one WR to the SA_QP until its queue is full.
Moving the first ibv_post_recv call: struct ibv_recv_wr *brwr;
from the get_pathrecord_info function to the end of init_device() solved the problem for me. (But I'm not sure if this is the appropriate position for the call.)
I saw this problem when I worked with OMPI 1.6.3, but the trunk does have the same bug.
Regards, Jens