Open bertwesarg opened 6 years ago
You have the correct forum. libugni's restrictions and requirements are the responsibility of the provider and application. If the application and libfabric agree on a threading model, it is up to the provider to ensure the application can use said threading model based on the agreements defined in the threading model definition. I'd be interested to know what threading model was specified in the fi_getinfo call.
Based on your output though, I'd have to say that there are some aspects of these stack traces that don't seem valid to me. The main thread should not be calling libugni directly.
Edit: I misunderstood your original description. libugni is not written to be thread safe. You cannot use libfabric with the GNI provider side-by-side with libugni from a different context. There is a caveat to this, but it doesn't apply to your use case.
Does the explanation above make sense?
James, thanks for the clarification. Though it is rather unexpected.
James, thanks for the clarification. Though it is rather unexpected.
I'm going to follow up with a colleague of mine. I know there are instances where libraries or applications use libugni in the manner that you are suggesting, but I suspect they use a different approach. What you are trying to accomplish might be possible.
So, I'll retract my statement. There are perfectly valid cases for using libugni from multiple contexts, and multiple threads, but it is predicated on some assumptions. Libugni is thread-safe to the communication domain (a libugni construct).
Would you mind explaining to me how it is that you are using libugni in the main thread, and how you are using libfabric? Specifically, how are you initializing each of the different communication contexts (libugni vs libfabric)?
So, I'll retract my statement. There are perfectly valid cases for using libugni from multiple contexts, and multiple threads, but it is predicated on some assumptions. Libugni is thread-safe to the communication domain (a libugni construct).
That is very encouraging. Thanks for looking deeper into the issue.
Would you mind explaining to me how it is that you are using libugni in the main thread, and how you are using libfabric? Specifically, how are you initializing each of the different communication contexts (libugni vs libfabric)?
I can't for the main thread, its MPI_Init
from cray-mpich/7.6.0
which uses ugni/6.0-1.0502.10863.8.29.ari
. For our libfabric
thread its:
struct fi_info* hints = fi_allocinfo();
hints->mode = FI_CONTEXT;
hints->caps = FI_TAGGED;
hints->ep_attr->type = FI_EP_RDM;
hints->tx_attr->msg_order = FI_ORDER_SAS;
hints->rx_attr->msg_order = FI_ORDER_SAS;
hints->domain_attr->threading = FI_THREAD_SAFE;
hints->domain_attr->control_progress = FI_PROGRESS_AUTO;
hints->domain_attr->data_progress = FI_PROGRESS_AUTO;
hints->domain_attr->resource_mgmt = FI_RM_ENABLED;
/* Get Informations about the used fabric service */
struct fi_info* info;
ret = fi_getinfo( FI_VERSION( 1, 0 ), NULL, NULL, 0ULL, hints, &info );
fi_freeinfo( hints );
/* open fabric provider */
struct fid_fabric* fabric;
ret = fi_fabric( info->fabric_attr, &fabric, NULL );
/* open fabric access domain */
struct fid_domain* domain;
ret = fi_domain( fabric, info, &domain, NULL );
[ Error checking removed for clarity. ]
After that we create a completion queue, an adress vector, and an end point. Do you need to see this too?
Thanks.
After that we create a completion queue, an adress vector, and an end point. Do you need to see this too?
Not really. I just wanted to sanity check some things.
How does this application function? You said it runs some operations with libfabric and some with MPI. What is libfabric being used for? Could you attach a core dump?
Here is the full back-trace of the error:
#0 0x00002aaaaccee875 in raise () from /lib64/libc.so.6
#1 0x00002aaaaccefe51 in abort () from /lib64/libc.so.6
#2 0x00002aaaaf9844ea in GNI_PostDataProbeById () from /opt/cray/ugni/default/lib64/libugni.so.0
#3 0x00002aaaabd27050 in MPID_nem_gni_datagram_poll () from /opt/cray/lib64/libmpich_gnu_51.so.3
#4 0x00002aaaabd23728 in MPID_nem_gni_poll () from /opt/cray/lib64/libmpich_gnu_51.so.3
#5 0x00002aaaabd01286 in MPIDI_CH3I_Progress () from /opt/cray/lib64/libmpich_gnu_51.so.3
#6 0x00002aaaabc0f67d in MPIR_Waitall_impl () from /opt/cray/lib64/libmpich_gnu_51.so.3
#7 0x00002aaaabc0fe96 in PMPI_Waitall () from /opt/cray/lib64/libmpich_gnu_51.so.3
#8 0x00002aaaaaf095cd in MPI_Waitall (count=26, requests=0x2f9b2b0, array_of_statuses=0x3050f20) at ../../build-mpi/../src/adapters/mpi/SCOREP_Mpi_P2p.c:1447
#9 0x00002aaaaaf2e26d in mpi_waitall_ (count=<optimized out>, array_of_requests=<optimized out>, array_of_statuses=<optimized out>, ierr=0x24c6e80 <__data_sbm_fd4_MOD_fd4err>) at ../../build-mpi/../src/adapters/mpi/SCOREP_Fmpi_P2p.c:1562
#10 0x000000000098dacb in fd4_couple_mod::fd4_couple_get (cpl=..., err=0, opt_novnull=<optimized out>) at ./framework/fd4_couple.F90:2927
#11 0x00000000007022e0 in src_runge_kutta::org_runge_kutta () at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/src_runge_kutta.f90:1456
#12 0x00000000008afd6e in organize_dynamics (yaction=<optimized out>, ierror=<optimized out>, yerrmsg=<optimized out>, dt_alter=<optimized out>, linit=<optimized out>, _yaction=<optimized out>, _yerrmsg=80)
at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/organize_dynamics.f90:372
#13 0x00000000008e8c87 in lmorg () at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/lmorg.f90:862
#14 0x000000000040415d in main (argc=<optimized out>, argv=<optimized out>) at /zhome/academic/HLRS/kud/kudjzieg/vlive/testing/cosmo-specs+fd4_hrsk2/lmpar/src_cosmo/lmorg.f90:164
#15 0x00002aaaaccdac36 in __libc_start_main () from /lib64/libc.so.6
#16 0x000000000040418d in _start () at ../sysdeps/x86_64/elf/start.S:113
We're developing a tool (frames 8,9) that wraps MPI to record function calls of a MPI application (frames 10 - 14). Libfabric provides the communication infrastructure for transferring the recorded data to other tool processes where the processing is done.
I hope this gives you a vague impression of our application.
It is certainly interesting. I have a vague idea where your application is crashing. Would you mind adding this to your aprun/srun?
UGNI_USE_LOGFILE=output.$ALPS_APP_PE UGNI_DEBUG=9
UGNI_USE_LOGFILE=output.$ALPS_APP_PE UGNI_DEBUG=9
does the $ALPS_APP_PE
needs to be quoted?
It doesn't seem like it needs it, no.
Would you mind adding this to your aprun?
@JoZie will try it tonight. We are also trying to workarround it with a mutex. I.e., manual libfabric
progress and we lock in our MPI wrapper the PMPI calls and in our libfabric
thread calls into it. :crossed_fingers:
I finally got a run with the uGNI logs active, but only with UGNI_DEBUG=4. I hope this is still helpful. The problem with higher debug levels is the amount of data that is generated (up to 25GB per process) And I had to use many processes, since the occurrence of the bug is more likely with an increasing number of processes and longer run-times.
Here are the output files of the erroneous process: oe00174.txt output.ugni_logfile.174.txt
I suspect you're getting a cdm id collision problem in uGNI's datagram path, although I thought the algorithm we're using in the GNI provider was different enough from that used in craypich that this problem would not happen.
Howard: Is there a simple way for us to change the CDM ID generation for us to test this idea?
Hmm... actually I'm not sure about the cdm id collision thing. That would have resulted in an error return from GNI_CdmAttach
. I don't think any level of UGNI debug will help here as the datagram code in ugni library doesn't have many debug statements if I recall correctly.
This approach should work however. I suggest 2 things:
MPICH_DYNAMIC_VCS
to see if by getting craypich to set up all its VCs ahead of time, it stops invoking ugni datagram stuff in the progress loop.try setting MPICH_DYNAMIC_VCS to see if by getting craypich to set up all its VCs ahead of time, it stops invoking ugni datagram stuff in the progress loop.
We may try this. Though I have a question: do you think that the problem arise because we are interfacing with UGNI through different threads or because we are interfacing with UGNI through two different higher level interfaces (i.e., MPI and libfabric)? CDM id collisions sounds to me it falls into the latter case. Though we are currently using a mutex to multiplex the interaction with UGNI between our two threads. And this seems to avoid this problem. But I can't imagine that the CDM id collision can be voided, just by using a mutex.
It may possibly be the former - accessing UGNI through different threads, although as @jswaro pointed out, since you're using separate uGNI objects (cdm, ep's, cq's) for craypich and the OFI GNI provider, you should be okay. If you were hitting a problem owing to using uGNI through two different high level interfaces, and using the same GNI RDMA credentials, and using a similar scheme for generating CDM Ids, you'd hit the id collision problem. But as I said above, if you were hitting that, you'd be getting a different error very near initialization.
I think we need someone with access to the relevant uGNI source code to look and see where abort is being called in the uGNI calls showing up in the traceback.
hmmm... actually since datagram path is almost entirely in the kernel, you may also get better info by using strace, and also running dmesg on the nodes where the job was run. If we're lucky
the kGNI
device driver may have chatted something there.
I just came back from conference. Sorry about the lack of response.
The abort is coming from GNI_PostDataProbeById, specifically from the ioctl where it attempts to post the dataprobe to the device through the kgni ioctl system. Given the error code reported by the fatal, it seems like it can't find the device based on what was provided. The device comes from the data embedded in the nic_handle, so perhaps the NIC handle is bad?
interesting. You're probably right @jswaro the nic handle craypich is using has somehow gotten corrupted. Was craypich initialized with MPI_THREAD_MULTIPLE support?
Out of curiosity, how is the helper thread created? Is it done via a call to pthread_create?
Was craypich initialized with MPI_THREAD_MULTIPLE support?
The problem seems to be present with and without requested MPI_THREAD_MULTIPLE
support (including setting MPICH_MAX_THREAD_SAFETY=multiple
). Our mutex between MPI and our libfabric thread also did not solved it completely (i.e., we obey MPI_THREAD_FUNNELED
on the GNI level) and the serialization has other drawbacks too.
Out of curiosity, how is the helper thread created? Is it done via a call to pthread_create?
Yes.
i'll take this and try to reproduce this problem with a simple test case with and without using pthreads.
Quick question though, at the point you see the abort in uGNI library, has the app already done some communication (send/recv, ones-sided, etc.) using the libfabric API?
Quick question though, at the point you see the abort in uGNI library, has the app already done some communication (send/recv, ones-sided, etc.) using the libfabric API?
Definitely yes. We start the libfabric thread inside the MPI_Init
/MPI_Init_thread
wrapper after the PMPI
call, and setup the communication with the processes outside of MPI_COMM_WORLD
.
@bertwesarg which FI version are you feeding to fi_getinfo
. Are you asking for 1.5 or an older version of the libfabric API?
@bertwesarg which FI version are you feeding to fi_getinfo. Are you asking for 1.5 or an older version of the libfabric API?
FI_VERSION( 1, 0 )
@bertwesarg if you're working off of the ofi-cray/libfabric-cray source, could you rerun your test? We're thinking #1411 may be relevant to what you're observing.
@bertwesarg first a heads up. You'll need to make sure when you configure libfabric with
--with-kdreg=no
otherwise libfabric will fail in the call to fi_domain. esp. if you're using a CLE 6 system.
That being said, I tested mixing Cray MPI with an app which uses the libfabric api and GNI provider directly and could not reproduce your problem. I'd suggest retesting using head of master for libfabric and see if #1411 has helped with the problem you're seeing.
@hppritcha unfortunately we weren't able to test #1411 because we, for the moment, reverted back to a MPI only solution. We thought this workaround worked, however after som time the bug re-appeared . So it's unlikely that the effect is the result of an interaction between MPICH and libfabric on the uGNI level.
We're suspecting that libunwind somehow interferes on the the uGNI level. But we are still working on a "minimal" example to reproduce this error.
@JoZie What version of the gcc module are you using? If you are using a 7.x version, I've seen problems issues with it. You could try using gcc/6.1 or /6.3
Specifically, the problems that I have observed have been with libunwind.
@jswaro Thanks for the advice! Last week I cold continue exploring the bug. I came to the conclusion that our problems with libfabric and libunwind are separate bugs.
However I did some digging in the open Issues to see if there are related bugs and came across #1312. I also tried disabling the memory registration cache which seems to fix the bug. But with this setting all applications are awfully slow. Setting it to udreg
doesn't help either.
Now my goal is to use FI_LOCAL_MR and do the memory management myself. But my implementation doesn't work yet (some GNI_PostRdma failed: GNI_RC_INVALID_PARAM error). Is there somewhere a reference implementation for this except the fi_pingpong
one?
Do you have the capability to compile libfabric with kdreg support? If not, then the internal memory registration cache could fault and cause all sorts of issues.
Keep in mind when using FI_LOCAL_MR, it is a deprecated flag for libfabric version 1.5 -- however FI_MR_LOCAL is not. If you use FI_MR_LOCAL or FI_LOCAL_MR (pre1.5), to turn off the GNI provider caching code with the fabric ops. That should eliminate any code paths the provider might take to optimize registration. Given you have an invalid param, I suspect the same code that was tripping you up without FI_LOCAL_MR is still present until you disable the caching mechanisms.
Thanks for the info! I cannot build with kdreg support since the header is missing, so I contacted my administrators.
I took me a while to get 1.5 with FI_MR_LOCAL to run. It doesn't work with MR_BASIC, you have to use all values of the BASIC_MAP. But in the end the error is the same although the mr_cache is disabled. And the UGNI_DEBUG=9 flag doesn't provide any useful output as well.
Dear all,
I'm not sure if this is the right forum, but anyway:
We would like to use
libfabric
with the GNI provider from inside a MPI application which uses MPICH/GNI on a Cray XC40 platform. But we have the impression that this does not play well together regarding threads. We start our own thread for doing justlibfabric
calls but no MPI calls, the reverse holds for the main thread. But we getabort()
s from inside thelibugni
library when the main thread does MPI calls.Here are two examples:
Is
libugni
prepared for this kind of usage at all?Thanks.