pmodels / pilgrim

Logger for MPI communication
Other
26 stars 6 forks source link

Assertion error when running on different machines #40

Open khatharsis42 opened 9 months ago

khatharsis42 commented 9 months ago

Issue description

Whenever I try to run Pilgrim to trace a MPI program running on a local machine, I have no issue. However, once I try to run it on another machine, I get the following issue: src/pilgrim_mpi_objects.c:172: create_request_id: Assertion 'entry == NULL' failed.

Steps to reproduce

I'm using mpich 4.0.2, and the latest version of Pilgrim. I have two nodes available, localnode and remotenode mpirun -np N --host localnode,remotenode -LD_PRELOAD <path to libpilgrim.so> <my executable> yields the aforementioned error as soon as N is greater than 1. If I remove the remote node, I can get N as big as I want it to be.

Possible fix

The mentionned line is the following:

int create_request_id(MPI_Request *req, bool from_universal_pool, int func_id, int src_or_dst, int tag, int comm) {
    if(req==NULL || *req == MPI_REQUEST_NULL)
        return invalid_request_id;

    RequestHash *entry = request_hash_entry(req);
    assert(entry == NULL); // <- this one

I've removed this assertion, and so far I've seen nothing weird happening. I have no idea as to whether that assertion is important.

wangvsa commented 9 months ago

Hi @khatharsis42 can you try tracing different applications to see if you get the same error? And is it possible to share your code so I can debug?

khatharsis42 commented 8 months ago

I'm using Pilgrim to trace a few mini-apps, and I've seen that particular bug when tracing AMG and Lulesh (once I use enough MPI processes, no problem with 8 but the bug appears when using 27). Interestingly, I've had no issue with Kripke.

wangvsa commented 8 months ago

Thanks. I'll test AMG and get back to you.