ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

FI_SOURCE_ERR and fi_cq_readerr invalid err_data_size #1409

Closed soumagne closed 7 years ago

soumagne commented 7 years ago

When FI_SOURCE_ERR is set and using fi_cq_readerr(), the gni provider is still unable to retrieve/provide the correct source address to the user when doing something like:

struct fi_cq_err_entry cq_err;

memset(&cq_err, 0, sizeof(cq_err));
fi_cq_readerr(cq_hdl, &cq_err, 0 /* flags */);
if (cq_err.err == FI_EADDRNOTAVAIL) {
    fi_addr_t addr;

    fi_av_insert(av_hdl, cq_err.err_data, 1, &addr, 0 /* flags */, NULL /* context */);
}

When setting a buffer pointer to cq_err.err_data and setting cq_err.err_data_size to 256 bytes for example, I noticed that the size that I was getting back for cq_err.err_data_size was always 1.

Looking at the code line 575 of gnix_cq.c, you've got:

err_data_cpylen = sizeof(*cq_priv->err_data);

but looking at the struct definition, you've got:

struct gnix_fid_cq {
[...]
        char err_data[GNIX_CQ_MAX_ERR_DATA_SIZE];
};

which means that the size copied is always going to be one char. Also GNIX_CQ_MAX_ERR_DATA_SIZE is set to 64, which means it is probably still going to be too small to be able to copy a source address (if I remember correctly the size of addresses was larger than that).

hppritcha commented 7 years ago

Hmmm... this path is suppose to be used only if the application has requested API VERSION less than 1.5. It may be this isn't set correctly somehow within the GNI provider.

soumagne commented 7 years ago

I don't think that matters that much as it looks like both code paths are only copying one byte max, which I assume is not enough to get a source address... neither is the GNIX_CQ_MAX_ERR_DATA_SIZE 64 bytes limit. So line 575 of gni_cq.c:

err_data_cpylen = sizeof(*cq_priv->err_data);

should be instead

err_data_cpylen = sizeof(cq_priv->err_data);

and line 585, instead of:

err_data_cpylen = MIN(buf->err_data_size, sizeof(*cq_priv->err_data));

it should be:

err_data_cpylen = MIN(buf->err_data_size, sizeof(cq_priv->err_data));
hppritcha commented 7 years ago

uff, you're right. thanks! I'll fix it.

hppritcha commented 7 years ago

looks like we've not been testing this path well, probably the reason this wasn't caught. I don't think the MIN is actually correct.

soumagne commented 7 years ago

yeah... it would probably not return something correct if the user passed err_data_size is too small