tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.4k stars 74.17k forks source link

Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed. Aborted (core dumped) #64681

Open Wasim04 opened 5 months ago

Wasim04 commented 5 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

2.15.0.post1

Custom code

Yes

OS platform and distribution

Rocky Linux 8.9

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

8.9.2.26

GPU model and memory

NVIDIA A100-SXM4-80GB

Current behavior?

I am trying to run a simple denoising autoencoder. my training data and label data are 900 samples of healpy maps with nside 64 resolution, loaded as numpy array. After normalising the maps, I used tf.data.Dataset.from_tensor_slices, to create dataset. when I used random noise to create these maps and ran on jupyter notebook, although took ages to initiate training after doing model.fit(), but it did run and produced some result. knowing that the model works, I tried to run on GPU with real data. this is where the issue started. it shows the following error: Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed. Aborted (core dumped), and the process stops.

Standalone code to reproduce the issue

Here is a colab link:

https://colab.research.google.com/drive/1odbf3gQT9h-DI6zdh5e1llmG9zhjY45l?usp=sharing

It runs on colab. but it doesn't run in the terminal.

Relevant log output

2024-03-27 12:14:07.288975: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed.
Aborted (core dumped)
tilakrayal commented 5 months ago

@Wasim04, I am trying to execute the mentioned code with the GPU and CPU environments and the time taken to execute the code is more than expected. Could you please allow some duration to analyse the issue and provide the update on the same. Thank you!

Wasim04 commented 5 months ago

@tilakrayal Thanks for your response. Please take your time.

Wasim04 commented 4 months ago

@tilakrayal Hi, just wondering if you had a chance to look into this issue?

OlivierBelan commented 2 months ago

Same issue here but not with tensorflow - the error seem to come from jax==0.4.30