Running libfabric/GNI and MPICH/GNI in parallel

bertwesarg commented 6 years ago

Dear all,

I'm not sure if this is the right forum, but anyway:

We would like to use libfabric with the GNI provider from inside a MPI application which uses MPICH/GNI on a Cray XC40 platform. But we have the impression that this does not play well together regarding threads. We start our own thread for doing just libfabric calls but no MPI calls, the reverse holds for the main thread. But we get abort()s from inside the libugni library when the main thread does MPI calls.

Here are two examples:

#0  0x00002aaaaccee875 in raise () from /lib64/libc.so.6            
#1  0x00002aaaaccefe51 in abort () from /lib64/libc.so.6            
#2  0x00002aaaaf9854ea in GNI_PostDataProbeById () from /opt/cray/ugni/default/lib64/libugni.so.0
#3  0x00002aaaabd27050 in MPID_nem_gni_datagram_poll () from /opt/cray/lib64/libmpich_gnu_51.so.3
#4  0x00002aaaabd23728 in MPID_nem_gni_poll () from /opt/cray/lib64/libmpich_gnu_51.so.3
#5  0x00002aaaabd01286 in MPIDI_CH3I_Progress () from /opt/cray/lib64/libmpich_gnu_51.so.3
#6  0x00002aaaabc0f67d in MPIR_Waitall_impl () from /opt/cray/lib64/libmpich_gnu_51.so.3
#7  0x00002aaaabc0fe96 in PMPI_Waitall () from /opt/cray/lib64/libmpich_gnu_51.so.3

#0  0x00002aaaaccee875 in raise () from /lib64/libc.so.6                                                                                 
#1  0x00002aaaaccefe51 in abort () from /lib64/libc.so.6                                                                                                                                                                                                                          
#2  0x00002aaaaf984cba in GNI_EpPostDataWId () from /opt/cray/ugni/default/lib64/libugni.so.0                                            
#3  0x00002aaaabd26574 in MPID_nem_gni_datagram_directed_post () from /opt/cray/lib64/libmpich_gnu_51.so.3                                                                                                                                                                        
#4  0x00002aaaabd31a64 in MPID_nem_gni_smsg_cm_progress_req () from /opt/cray/lib64/libmpich_gnu_51.so.3
#5  0x00002aaaabd33a59 in MPID_nem_gni_smsg_cm_send_conn_req () from /opt/cray/lib64/libmpich_gnu_51.so.3                                
#6  0x00002aaaabd1ccdc in MPID_nem_gni_iSendContig_start () from /opt/cray/lib64/libmpich_gnu_51.so.3                                    
#7  0x00002aaaabd1d5f3 in MPID_nem_gni_iStartContigMsg () from /opt/cray/lib64/libmpich_gnu_51.so.3                                      
#8  0x00002aaaabcfad2b in MPIDI_CH3_iStartMsgv () from /opt/cray/lib64/libmpich_gnu_51.so.3
#9  0x00002aaaabd2c6ce in MPID_nem_gni_lmt_initiate_lmt () from /opt/cray/lib64/libmpich_gnu_51.so.3                                     
#10 0x00002aaaabd0e842 in MPID_nem_lmt_RndvSend () from /opt/cray/lib64/libmpich_gnu_51.so.3                                             
#11 0x00002aaaabcedd1e in MPID_Isend () from /opt/cray/lib64/libmpich_gnu_51.so.3                                                        
#12 0x00002aaaabc02db4 in PMPI_Isend () from /opt/cray/lib64/libmpich_gnu_51.so.3

Is libugni prepared for this kind of usage at all?

Thanks.

jswaro commented 6 years ago

You have the correct forum. libugni's restrictions and requirements are the responsibility of the provider and application. If the application and libfabric agree on a threading model, it is up to the provider to ensure the application can use said threading model based on the agreements defined in the threading model definition. I'd be interested to know what threading model was specified in the fi_getinfo call.

jswaro commented 6 years ago

Based on your output though, I'd have to say that there are some aspects of these stack traces that don't seem valid to me. The main thread should not be calling libugni directly.

Edit: I misunderstood your original description. libugni is not written to be thread safe. You cannot use libfabric with the GNI provider side-by-side with libugni from a different context. There is a caveat to this, but it doesn't apply to your use case.

jswaro commented 6 years ago

Does the explanation above make sense?

bertwesarg commented 6 years ago

James, thanks for the clarification. Though it is rather unexpected.

jswaro commented 6 years ago

James, thanks for the clarification. Though it is rather unexpected.

I'm going to follow up with a colleague of mine. I know there are instances where libraries or applications use libugni in the manner that you are suggesting, but I suspect they use a different approach. What you are trying to accomplish might be possible.

jswaro commented 6 years ago

So, I'll retract my statement. There are perfectly valid cases for using libugni from multiple contexts, and multiple threads, but it is predicated on some assumptions. Libugni is thread-safe to the communication domain (a libugni construct).

Would you mind explaining to me how it is that you are using libugni in the main thread, and how you are using libfabric? Specifically, how are you initializing each of the different communication contexts (libugni vs libfabric)?

bertwesarg commented 6 years ago

So, I'll retract my statement. There are perfectly valid cases for using libugni from multiple contexts, and multiple threads, but it is predicated on some assumptions. Libugni is thread-safe to the communication domain (a libugni construct).

That is very encouraging. Thanks for looking deeper into the issue.

Would you mind explaining to me how it is that you are using libugni in the main thread, and how you are using libfabric? Specifically, how are you initializing each of the different communication contexts (libugni vs libfabric)?

I can't for the main thread, its MPI_Init from cray-mpich/7.6.0which uses ugni/6.0-1.0502.10863.8.29.ari. For our libfabric thread its:

    struct fi_info* hints = fi_allocinfo();
    hints->mode               = FI_CONTEXT;
    hints->caps               = FI_TAGGED;
    hints->ep_attr->type      = FI_EP_RDM;
    hints->tx_attr->msg_order = FI_ORDER_SAS;
    hints->rx_attr->msg_order = FI_ORDER_SAS;

    hints->domain_attr->threading        = FI_THREAD_SAFE;
    hints->domain_attr->control_progress = FI_PROGRESS_AUTO;
    hints->domain_attr->data_progress    = FI_PROGRESS_AUTO;
    hints->domain_attr->resource_mgmt    = FI_RM_ENABLED;

    /* Get Informations about the used fabric service */
    struct fi_info*    info;
    ret = fi_getinfo( FI_VERSION( 1, 0 ), NULL, NULL, 0ULL, hints, &info );
    fi_freeinfo( hints );

    /* open fabric provider */
    struct fid_fabric* fabric;
    ret = fi_fabric( info->fabric_attr, &fabric, NULL );

    /* open fabric access domain */
    struct fid_domain* domain;
    ret = fi_domain( fabric, info, &domain, NULL );

[ Error checking removed for clarity. ]

After that we create a completion queue, an adress vector, and an end point. Do you need to see this too?

Thanks.

jswaro commented 6 years ago

After that we create a completion queue, an adress vector, and an end point. Do you need to see this too?

Not really. I just wanted to sanity check some things.

How does this application function? You said it runs some operations with libfabric and some with MPI. What is libfabric being used for? Could you attach a core dump?

JoZie commented 6 years ago

Here is the full back-trace of the error:

#0  0x00002aaaaccee875 in raise () from /lib64/libc.so.6
#1  0x00002aaaaccefe51 in abort () from /lib64/libc.so.6
#2  0x00002aaaaf9844ea in GNI_PostDataProbeById () from /opt/cray/ugni/default/lib64/libugni.so.0
#3  0x00002aaaabd27050 in MPID_nem_gni_datagram_poll () from /opt/cray/lib64/libmpich_gnu_51.so.3
#4  0x00002aaaabd23728 in MPID_nem_gni_poll () from /opt/cray/lib64/libmpich_gnu_51.so.3
#5  0x00002aaaabd01286 in MPIDI_CH3I_Progress () from /opt/cray/lib64/libmpich_gnu_51.so.3
#6  0x00002aaaabc0f67d in MPIR_Waitall_impl () from /opt/cray/lib64/libmpich_gnu_51.so.3
#7  0x00002aaaabc0fe96 in PMPI_Waitall () from /opt/cray/lib64/libmpich_gnu_51.so.3
#8  0x00002aaaaaf095cd in MPI_Waitall (count=26, requests=0x2f9b2b0, array_of_statuses=0x3050f20) at ../../build-mpi/../src/adapters/mpi/SCOREP_Mpi_P2p.c:1447
#9  0x00002aaaaaf2e26d in mpi_waitall_ (count=<optimized out>, array_of_requests=<optimized out>, array_of_statuses=<optimized out>, ierr=0x24c6e80 <__data_sbm_fd4_MOD_fd4err>) at ../../build-mpi/../src/adapters/mpi/SCOREP_Fmpi_P2p.c:1562
#10 0x000000000098dacb in fd4_couple_mod::fd4_couple_get (cpl=..., err=0, opt_novnull=<optimized out>) at ./framework/fd4_couple.F90:2927
#11 0x00000000007022e0 in src_runge_kutta::org_runge_kutta () at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/src_runge_kutta.f90:1456
#12 0x00000000008afd6e in organize_dynamics (yaction=<optimized out>, ierror=<optimized out>, yerrmsg=<optimized out>, dt_alter=<optimized out>, linit=<optimized out>, _yaction=<optimized out>, _yerrmsg=80)
    at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/organize_dynamics.f90:372 
#13 0x00000000008e8c87 in lmorg () at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/lmorg.f90:862
#14 0x000000000040415d in main (argc=<optimized out>, argv=<optimized out>) at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/lmorg.f90:164
#15 0x00002aaaaccdac36 in __libc_start_main () from /lib64/libc.so.6
#16 0x000000000040418d in _start () at ../sysdeps/x86_64/elf/start.S:113

We're developing a tool (frames 8,9) that wraps MPI to record function calls of a MPI application (frames 10 - 14). Libfabric provides the communication infrastructure for transferring the recorded data to other tool processes where the processing is done.

I hope this gives you a vague impression of our application.

jswaro commented 6 years ago

It is certainly interesting. I have a vague idea where your application is crashing. Would you mind adding this to your aprun/srun?

UGNI_USE_LOGFILE=output.$ALPS_APP_PE UGNI_DEBUG=9

bertwesarg commented 6 years ago

UGNI_USE_LOGFILE=output.$ALPS_APP_PE UGNI_DEBUG=9

does the $ALPS_APP_PE needs to be quoted?

jswaro commented 6 years ago

It doesn't seem like it needs it, no.

bertwesarg commented 6 years ago

Would you mind adding this to your aprun?

@JoZie will try it tonight. We are also trying to workarround it with a mutex. I.e., manual libfabric progress and we lock in our MPI wrapper the PMPI calls and in our libfabric thread calls into it. :crossed_fingers:

JoZie commented 6 years ago

I finally got a run with the uGNI logs active, but only with UGNI_DEBUG=4. I hope this is still helpful. The problem with higher debug levels is the amount of data that is generated (up to 25GB per process) And I had to use many processes, since the occurrence of the bug is more likely with an increasing number of processes and longer run-times.

Here are the output files of the erroneous process: oe00174.txt output.ugni_logfile.174.txt

hppritcha commented 6 years ago

I suspect you're getting a cdm id collision problem in uGNI's datagram path, although I thought the algorithm we're using in the GNI provider was different enough from that used in craypich that this problem would not happen.

jswaro commented 6 years ago

Howard: Is there a simple way for us to change the CDM ID generation for us to test this idea?

hppritcha commented 6 years ago

Hmm... actually I'm not sure about the cdm id collision thing. That would have resulted in an error return from GNI_CdmAttach. I don't think any level of UGNI debug will help here as the datagram code in ugni library doesn't have many debug statements if I recall correctly.

This approach should work however. I suggest 2 things:

someone look at the ugni source code for the version the user is working with and see what would be invoking the abort and
try setting MPICH_DYNAMIC_VCS to see if by getting craypich to set up all its VCs ahead of time, it stops invoking ugni datagram stuff in the progress loop.

bertwesarg commented 6 years ago

try setting MPICH_DYNAMIC_VCS to see if by getting craypich to set up all its VCs ahead of time, it stops invoking ugni datagram stuff in the progress loop.

We may try this. Though I have a question: do you think that the problem arise because we are interfacing with UGNI through different threads or because we are interfacing with UGNI through two different higher level interfaces (i.e., MPI and libfabric)? CDM id collisions sounds to me it falls into the latter case. Though we are currently using a mutex to multiplex the interaction with UGNI between our two threads. And this seems to avoid this problem. But I can't imagine that the CDM id collision can be voided, just by using a mutex.

hppritcha commented 6 years ago

It may possibly be the former - accessing UGNI through different threads, although as @jswaro pointed out, since you're using separate uGNI objects (cdm, ep's, cq's) for craypich and the OFI GNI provider, you should be okay. If you were hitting a problem owing to using uGNI through two different high level interfaces, and using the same GNI RDMA credentials, and using a similar scheme for generating CDM Ids, you'd hit the id collision problem. But as I said above, if you were hitting that, you'd be getting a different error very near initialization.

I think we need someone with access to the relevant uGNI source code to look and see where abort is being called in the uGNI calls showing up in the traceback.

hmmm... actually since datagram path is almost entirely in the kernel, you may also get better info by using strace, and also running dmesg on the nodes where the job was run. If we're lucky the kGNI device driver may have chatted something there.

jswaro commented 6 years ago

I just came back from conference. Sorry about the lack of response.

The abort is coming from GNI_PostDataProbeById, specifically from the ioctl where it attempts to post the dataprobe to the device through the kgni ioctl system. Given the error code reported by the fatal, it seems like it can't find the device based on what was provided. The device comes from the data embedded in the nic_handle, so perhaps the NIC handle is bad?

hppritcha commented 6 years ago

interesting. You're probably right @jswaro the nic handle craypich is using has somehow gotten corrupted. Was craypich initialized with MPI_THREAD_MULTIPLE support?

hppritcha commented 6 years ago

Out of curiosity, how is the helper thread created? Is it done via a call to pthread_create?

bertwesarg commented 6 years ago

Was craypich initialized with MPI_THREAD_MULTIPLE support?

The problem seems to be present with and without requested MPI_THREAD_MULTIPLE support (including setting MPICH_MAX_THREAD_SAFETY=multiple). Our mutex between MPI and our libfabric thread also did not solved it completely (i.e., we obey MPI_THREAD_FUNNELED on the GNI level) and the serialization has other drawbacks too.

Out of curiosity, how is the helper thread created? Is it done via a call to pthread_create?

Yes.

hppritcha commented 6 years ago

i'll take this and try to reproduce this problem with a simple test case with and without using pthreads.

hppritcha commented 6 years ago

Quick question though, at the point you see the abort in uGNI library, has the app already done some communication (send/recv, ones-sided, etc.) using the libfabric API?

bertwesarg commented 6 years ago

Quick question though, at the point you see the abort in uGNI library, has the app already done some communication (send/recv, ones-sided, etc.) using the libfabric API?

Definitely yes. We start the libfabric thread inside the MPI_Init/MPI_Init_thread wrapper after the PMPI call, and setup the communication with the processes outside of MPI_COMM_WORLD.

hppritcha commented 6 years ago

@bertwesarg which FI version are you feeding to fi_getinfo. Are you asking for 1.5 or an older version of the libfabric API?

bertwesarg commented 6 years ago

@bertwesarg which FI version are you feeding to fi_getinfo. Are you asking for 1.5 or an older version of the libfabric API?

FI_VERSION( 1, 0 )

hppritcha commented 6 years ago

@bertwesarg if you're working off of the ofi-cray/libfabric-cray source, could you rerun your test? We're thinking #1411 may be relevant to what you're observing.

hppritcha commented 6 years ago

@bertwesarg first a heads up. You'll need to make sure when you configure libfabric with

--with-kdreg=no

otherwise libfabric will fail in the call to fi_domain. esp. if you're using a CLE 6 system.

That being said, I tested mixing Cray MPI with an app which uses the libfabric api and GNI provider directly and could not reproduce your problem. I'd suggest retesting using head of master for libfabric and see if #1411 has helped with the problem you're seeing.

JoZie commented 6 years ago

@hppritcha unfortunately we weren't able to test #1411 because we, for the moment, reverted back to a MPI only solution. We thought this workaround worked, however after som time the bug re-appeared . So it's unlikely that the effect is the result of an interaction between MPICH and libfabric on the uGNI level.

We're suspecting that libunwind somehow interferes on the the uGNI level. But we are still working on a "minimal" example to reproduce this error.

jswaro commented 6 years ago

@JoZie What version of the gcc module are you using? If you are using a 7.x version, I've seen problems issues with it. You could try using gcc/6.1 or /6.3

Specifically, the problems that I have observed have been with libunwind.

JoZie commented 6 years ago

@jswaro Thanks for the advice! Last week I cold continue exploring the bug. I came to the conclusion that our problems with libfabric and libunwind are separate bugs.

However I did some digging in the open Issues to see if there are related bugs and came across #1312. I also tried disabling the memory registration cache which seems to fix the bug. But with this setting all applications are awfully slow. Setting it to udreg doesn't help either.

Now my goal is to use FI_LOCAL_MR and do the memory management myself. But my implementation doesn't work yet (some GNI_PostRdma failed: GNI_RC_INVALID_PARAM error). Is there somewhere a reference implementation for this except the fi_pingpong one?

jswaro commented 6 years ago

Do you have the capability to compile libfabric with kdreg support? If not, then the internal memory registration cache could fault and cause all sorts of issues.

Keep in mind when using FI_LOCAL_MR, it is a deprecated flag for libfabric version 1.5 -- however FI_MR_LOCAL is not. If you use FI_MR_LOCAL or FI_LOCAL_MR (pre1.5), to turn off the GNI provider caching code with the fabric ops. That should eliminate any code paths the provider might take to optimize registration. Given you have an invalid param, I suspect the same code that was tripping you up without FI_LOCAL_MR is still present until you disable the caching mechanisms.

JoZie commented 6 years ago

Thanks for the info! I cannot build with kdreg support since the header is missing, so I contacted my administrators.

I took me a while to get 1.5 with FI_MR_LOCAL to run. It doesn't work with MR_BASIC, you have to use all values of the BASIC_MAP. But in the end the error is the same although the mr_cache is disabled. And the UGNI_DEBUG=9 flag doesn't provide any useful output as well.

ofi-cray / libfabric-cray

Running libfabric/GNI and MPICH/GNI in parallel #1406