Open anarazh opened 4 years ago
Hi Anara,
thanks for reporting again on GitHub. As already stated: please rebuild Score-P after all modules you need are loaded, to avoid conflicting MPI Versions.
Best,
Andreas
Hello Andreas,
I just rebuilt Score-P with required modules and I still get the same errors. Running a dummy python script with the python binding works as usual.
Could you please try to install the latest python bindings? I did a small change to the LD_PRELOAD
order.
But to be honest, I do have currently no clue what's wrong here ...
Best,
Andreas
Here is a discussion about a similar problem: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-September/007126.html and here https://apps.fz-juelich.de/jsc/hps/jureca/known-issues.html#segmentation-faults-with-mvapich2
As mentioned, the application shows expected behavior with unset LD_PRELOAD without the Score-P.
I'm trying to test other MPI implementation than MVAPICH currently and let you know how it goes.
Best regards, Anara
Hey,
I'm trying to test other MPI implementation than MVAPICH currently and let you know how it goes.
this seems the most promising approach. Unsetting the LD_PREOLOAD
would make it very difficult to trace the MPI communication.
Best,
Andreas
Think about this Issue, about the last few days I had another idea what might help to debug:
LD_DEBUG
can be used to debug library issues (http://www.bnikolic.co.uk/blog/linux-ld-debug.html). Setting LD_DEBUG=all
will show all library-related information. Can you add these before a call of your application with and without Score-P Python?
Best,
Andreas
Thank you Andreas,
The original machine is under maintenance since November. I am using another machine and OpenMPI implementation and it seems to work so far, I still need to run more tests though. I'll run a test with LD_DEBUG
once I get to the original machine which should be early 2021 but can be sooner.
Best, Anara
As long as it works for you no hurries ;-).
Best,
Andreas
Dear team,
I'm getting the following error when I run Score-P with a module for tracing python scripts:
2020-10-20 09:24:14.149317: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.00M (10485\ 76 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 921.8K (9438\ 72 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149366: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 829.8K (8496\ 64 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149373: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 747.0K (7649\ 28 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149380: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 672.5K (6886\ 40 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context
The error files grows very quickly and I end up killing the job. I use a custom Score-P build. The details about the environment setup is in the attached job script and the error output is attached too. Without the Score-P, the application runs as expected even without specifying the LD_PRELOAD for MPI.
When I run Score-P with the LD_PRELOAD set, I get the following error instead:
[Score-P] src/adapters/mpi/SCOREP_Mpi_Env.c:230: Warning: MPI environment initialization request and provided level exceed MPI_THREAD_FUNNELED! 2020-10-19 10:56:13.384533: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494285000 Hz [rc0003:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) srun: error: rc0003: task 0: Segmentation fault
Would appreciate any feedback on this issue. Thanks in advance!
Anara err_example.txt job-example.txt