open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

Too many calls to ibv_post_recv in get_pathrecord_info results in 'out of memory' errors #165

Closed ompiteam closed 3 years ago

ompiteam commented 10 years ago

Hello,

I encountered an error in the get_pathrecord_info function of ompi/mca/btl/openib/connect/btl_openib_connect_sl.c. For larger numbers (in my case >4) of MPI processes I see the following errors: [[34818,1],0][connect/btl_openib_connect_sl.c:238:get_pathrecord_info] error posting receive on QP [0x3a0050] errno says: Success [0]

The return value of ibv_post_recv is '12' (ibverbs returns the error code instead of setting errno, therefor we get the 'Success [0]').

ibv_post_recv is called multiple times for the same QP (sa_qp_cache), which was set up in init_device() in the same file. So, for every SL query the get_pathrecord_info functions adds one WR to the SA_QP until its queue is full.

Moving the first ibv_post_recv call: struct ibv_recv_wr *brwr;

rc = ibv_post_recv(cache->qp, &(cache->rwr), &brwr);
if (0 != rc) {
    BTL_ERROR(("error posing receive on QP[%x] errno says: %s [%d]",
               cache->qp->qp_num, strerror(errno), errno));
    return OMPI_ERROR;
}

from the get_pathrecord_info function to the end of init_device() solved the problem for me. (But I'm not sure if this is the appropriate position for the call.)

I saw this problem when I worked with OMPI 1.6.3, but the trunk does have the same bug.

Regards, Jens

ompiteam commented 10 years ago

Imported from trac issue 3420. Created by domke on 2012-12-09T20:38:27, last modified: 2012-12-14T07:29:35

awlauria commented 3 years ago

Openib btl is removed. Closing.