Error with Score-P and TensorFlow

anarazh commented 4 years ago

Dear team,

I'm getting the following error when I run Score-P with a module for tracing python scripts:

2020-10-20 09:24:14.149317: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.00M (10485\ 76 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 921.8K (9438\ 72 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149366: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 829.8K (8496\ 64 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149373: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 747.0K (7649\ 28 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context 2020-10-20 09:24:14.149380: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 672.5K (6886\ 40 bytes) from device: CUDA_ERROR_INVALID_CONTEXT: invalid device context

The error files grows very quickly and I end up killing the job. I use a custom Score-P build. The details about the environment setup is in the attached job script and the error output is attached too. Without the Score-P, the application runs as expected even without specifying the LD_PRELOAD for MPI.

When I run Score-P with the LD_PRELOAD set, I get the following error instead:

[Score-P] src/adapters/mpi/SCOREP_Mpi_Env.c:230: Warning: MPI environment initialization request and provided level exceed MPI_THREAD_FUNNELED! 2020-10-19 10:56:13.384533: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494285000 Hz [rc0003:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) srun: error: rc0003: task 0: Segmentation fault

Would appreciate any feedback on this issue. Thanks in advance!

Anara err_example.txt job-example.txt

AndreasGocht commented 4 years ago

Hi Anara,

thanks for reporting again on GitHub. As already stated: please rebuild Score-P after all modules you need are loaded, to avoid conflicting MPI Versions.

Best,

Andreas

anarazh commented 3 years ago

Hello Andreas,

I just rebuilt Score-P with required modules and I still get the same errors. Running a dummy python script with the python binding works as usual.

AndreasGocht commented 3 years ago

Could you please try to install the latest python bindings? I did a small change to the LD_PRELOAD order.

But to be honest, I do have currently no clue what's wrong here ...

Best,

Andreas

anarazh commented 3 years ago

Here is a discussion about a similar problem: http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2020-September/007126.html and here https://apps.fz-juelich.de/jsc/hps/jureca/known-issues.html#segmentation-faults-with-mvapich2

As mentioned, the application shows expected behavior with unset LD_PRELOAD without the Score-P.

I'm trying to test other MPI implementation than MVAPICH currently and let you know how it goes.

Best regards, Anara

AndreasGocht commented 3 years ago

Hey,

I'm trying to test other MPI implementation than MVAPICH currently and let you know how it goes.

this seems the most promising approach. Unsetting the LD_PREOLOAD would make it very difficult to trace the MPI communication.

Best,

Andreas

AndreasGocht commented 3 years ago

Think about this Issue, about the last few days I had another idea what might help to debug:

LD_DEBUG can be used to debug library issues (http://www.bnikolic.co.uk/blog/linux-ld-debug.html). Setting LD_DEBUG=all will show all library-related information. Can you add these before a call of your application with and without Score-P Python?

Best,

Andreas

anarazh commented 3 years ago

Thank you Andreas,

The original machine is under maintenance since November. I am using another machine and OpenMPI implementation and it seems to work so far, I still need to run more tests though. I'll run a test with LD_DEBUG once I get to the original machine which should be early 2021 but can be sooner.

Best, Anara

AndreasGocht commented 3 years ago

As long as it works for you no hurries ;-).

Best,

Andreas

score-p / scorep_binding_python

Error with Score-P and TensorFlow #112