Closed agemagician closed 1 year ago
I have test it with a Google cloud machine with 8x V100, and it gives a similar issue:
2020-08-27 18:34:07.848853: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:55]
********************************
Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
For the GPU problem, I found out that Jax doesn't detect the GPU because of Cuda 11 problem with tensorflow.
Still, I can't figure out what is the problem with the TPU.
I think it may just be taking a very long time to compile. (And potentially the tpu_driver has a time-out that's too short here).
Thanks @jekbradbury for your reply. Is there a solution around it ?
I usually train much bigger models with tensorflow and Pytorch without any issues. Is this a problem related to Jax ?
I have fixed the Tensorflow issue but still Jax can't see the GPUs:
Python 3.6.10 |Anaconda, Inc.| (default, May 8 2020, 02:54:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-08-27 20:21:00.806963: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> physical_devices = tf.config.list_physical_devices('GPU')
2020-08-27 20:21:04.095189: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-27 20:21:04.548746: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.550269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.550419: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.551886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.551995: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.553494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:00:06.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.553588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.555113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:00:07.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.555202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.556676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 4 with properties:
pciBusID: 0000:00:08.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.556769: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.558275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 5 with properties:
pciBusID: 0000:00:09.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.558362: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.559834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 6 with properties:
pciBusID: 0000:00:0a.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.559919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.561362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 7 with properties:
pciBusID: 0000:00:0b.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-08-27 20:21:04.561405: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-27 20:21:04.563455: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-08-27 20:21:04.565374: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-08-27 20:21:04.565677: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-08-27 20:21:04.567552: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-08-27 20:21:04.568606: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-08-27 20:21:04.572985: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-08-27 20:21:04.573151: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.574731: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.576168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.577606: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.579063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.580476: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.582068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.583612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.585144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.586596: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.588028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.589484: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.590972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.592478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.593962: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.595447: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-27 20:21:04.596795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
>>> from jax.lib import xla_bridge
>>> print(xla_bridge.get_backend().platform)
/home/ahmed/anaconda3/envs/trax/lib/python3.6/site-packages/jax/lib/xla_bridge.py:130: UserWarning: No GPU/TPU found, falling back to CPU.
warnings.warn('No GPU/TPU found, falling back to CPU.')
cpu
>>>
Any news about this issue ?
If you just import jax and not TF, does jax see the GPUs?
No, it doesn't see the GPU, but TensorFlow sees it.
My main issue here is the jax compilation speed, which is very very slow. I killed the process after left it for 1 hour.
Can you tell which line of code is causing the long compile?
For the GPU issue, how are you installing jax+ jaxib?
1) Can you tell which line of code is causing the long compile? The trainer:
trainer = trax.supervised.Trainer(
model=trax.models.Reformer,
loss_fn=trax.layers.CrossEntropyLoss(),
optimizer=trax.optimizers.Adam,
lr_schedule=trax.lr.multifactor(),
inputs=trax.data.inputs.Inputs(my_inputs),
output_dir=output_dir)
I believe this trigger JAX compilation.
2) For the GPU issue, I figured out the problem. I was using : pip install --upgrade jax jaxlib Which doesn't support CUDA, and I followed the readme and Jax now can see the GPUs. I think the default Jax installation should support both TPUs and GPUs. This will be easier for users.
The issue of very long compilation doesn't exist using the GPU for the same command mentioned above.
I have tested the above script using GPUs and it is working without any issue. The long compilation process only occurs with TPUs.
Glad to hear you resolved the jaxlib issue! Unfortunately we cannot support GPUs by default since we need different jaxlibs for different CUDA versions. https://github.com/google/jax/pull/4065 should make this a little simpler at least.
For the long compilation on TPU, like @jekbradbury says, there might be an issue with the underlying TPU driver timing out. We're aware of this issue, but I don't know of any immediate workaround. Your best bet is probably to use GPU for now.
Thanks a lot @skye for your explanation and support. I will use the GPUs for now and I hope the TPU issue will be solved in the near future.
I am also getting the same error, where I am trying to train the standard TrnasformerLM, with the slight modification in the parameters(vocab_size=50000, max_len=1024). The same code works on the COLAB + TPU setup, but when I try it with the GCE-VM + CLOUD-TPU setup I am getting the same error.
now 12.01.22 and the problem still exists
I am also getting the same error, where I am trying to train the standard TrnasformerLM, with the slight modification in the parameters(vocab_size=50000, max_len=1024). The same code works on the COLAB + TPU setup, but when I try it with the GCE-VM + CLOUD-TPU setup I am getting the same error.
I have the same issue. I can run big bird model with max_length of 1024 but compilation times out with max length of 2048.
We no longer support remote Cloud TPUs, which are the kind of TPU referenced in this issue. We only support Cloud TPU VMs, which have a very different runtime stack. You can get TPU VMs either from GCP or via Kaggle notebooks at the time of writing.
Hello,
I am trying to train reformer model using Trax and JAX. The training fails on Google Colab because of memory limitation. When I run it on google cloud server + TPU, it hangs on the "trax.supervised.Trainer".
The warning is as follows:
2020-08-26 17:46:37.421334: W external/org_tensorflow/tensorflow/compiler/xla/python/tpu_driver/client/tpu_client.cc:601] TPU Execute is taking a long time. This might be due to a deadlock between multiple TPU cores or a very slow program.
The code is very straight forward:
Any idea how I can solve this issue ?