Closed danyaljj closed 4 years ago
It looks like it's falling back to CPU for some reason. Do you see anything earlier in the logs that might tell you why?
hmm ... here is the top
command right before it crashes:
which shoes that the program is using lots of CPU:
28352 danielk 20 0 40.123g 0.022t 464504 S 789.7 36.3 7:15.07 t5_mesh_transfo
However, there are several GPU threads too, which baffles me (some of these might be stale, from previous tries):
In between logs I see: Current candidate devices are [ /job:localhost/replica:0/task:0/device:CPU:0]
and Assign: CPU
.
Extended log:
2020-01-08 16:36:54.382904: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations
(that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible
_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU
You can see from the errors you're getting that the ops were placed on CPU instead of GPU. For example:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
I saw this happen when I was testing and it seemed to be due to issues with my CUDA setup. I was on a VM so I basically started from scratch and got it to work.
I think it would be useful to see what's logged before you get to those relocations.
Log right at the beginning:
$ t5_mesh_transformer --model_dir="danielk-files/models" --t5_tfds_data_dir="danielk-files" --gin_file="dataset.gin" --gin_param="utils.run.mesh_shape = 'model:1,batch:1'" --gin_param="utils.run.mesh_devices = ['gpu:1']" --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" --gin_file="gs://t5-data/pretrained_models/small/operative_confi
g.gin" --gin_param="batch_size=2"
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
INFO:tensorflow:model_type=bitransformer
I0108 16:43:30.913287 139962466866944 utils.py:1664] model_type=bitransformer
INFO:tensorflow:mode=train
I0108 16:43:30.913425 139962466866944 utils.py:1665] mode=train
INFO:tensorflow:sequence_length={'inputs': 512, 'targets': 512}
I0108 16:43:30.913482 139962466866944 utils.py:1666] sequence_length={'inputs': 512, 'targets': 512}
INFO:tensorflow:batch_size=2048
I0108 16:43:30.913529 139962466866944 utils.py:1667] batch_size=2048
INFO:tensorflow:train_steps=1000000000
I0108 16:43:30.913570 139962466866944 utils.py:1668] train_steps=1000000000
INFO:tensorflow:mesh_shape=model:1,batch:1
I0108 16:43:30.913610 139962466866944 utils.py:1669] mesh_shape=model:1,batch:1
INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
I0108 16:43:30.913662 139962466866944 utils.py:1670] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
INFO:tensorflow:Building TPUConfig with tpu_job_name=None
I0108 16:43:30.913738 139962466866944 utils.py:1685] Building TPUConfig with tpu_job_name=None
INFO:tensorflow:Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4
abdf4b6a0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_con
figuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0108 16:43:30.916487 139962466866944 estimator.py:212] Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4
abdf4b6a0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_con
figuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0108 16:43:30.916929 139962466866944 tpu_context.py:220] _TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
W0108 16:43:30.916996 139962466866944 tpu_context.py:222] eval_on_tpu ignored because use_tpu is False.
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0108 16:43:30.924340 139962466866944 deprecation.py:506] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0108 16:43:30.924594 139962466866944 deprecation.py:323] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0108 16:43:30.931804 139962466866944 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0108 16:43:31.075816 139962466866944 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0108 16:43:31.082559 139962466866944 dataset_builder.py:273] Reusing dataset glue (danielk-files/glue/mrpc/0.0.2)
I0108 16:43:31.083067 139962466866944 dataset_builder.py:434] Constructing tf.data.Dataset for split train, from danielk-files/glue/mrpc/0.0.2
2020-01-08 16:43:31.941356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-08 16:43:31.995155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-08 16:43:31.997136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:02:00.0
2020-01-08 16:43:31.997226: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997275: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997402: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997453: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997592: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997642: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:32.042586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-08 16:43:32.042640: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
W0108 16:43:33.304034 139962466866944 deprecation.py:323] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
INFO:tensorflow:Calling model_fn.
I0108 16:43:34.641363 139962466866944 estimator.py:1148] Calling model_fn.
INFO:tensorflow:Running train on CPU
I0108 16:43:34.641528 139962466866944 tpu_estimator.py:3124] Running train on CPU
INFO:tensorflow:feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
I0108 16:43:34.642672 139962466866944 utils.py:374] feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
W0108 16:43:34.642765 139962466866944 deprecation.py:323] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
I see some warnings like libcudart.so.10.0: cannot open shared object file: No such file or directory
. Could be that?
Yes, this is the problem I had at well. It may be because you have 10.1 installed...
ah okay!
Could you elaborate on this?:
It may be because you have 10.1 installed...
Btw:
$ cat /usr/local/cuda/version.txt
CUDA Version 9.0.176
It's looking for libcu*.so.10.0
but (according to your nvidia-smi
printout at least) you have v10.1 which probably names the files libcu*.so.10.1
.
Have a look at https://github.com/tensorflow/tensorflow/issues/26182
Thanks!
FYI, it looks like TF 2.1.0 is compatible with CUDA 10.1 according to https://www.tensorflow.org/install/source#tested_build_configurations
FYI, it looks like TF 2.1.0 is compatible with CUDA 10.1 according to ...
This is a bit tricky because t5 has an explicit requirement on earlier tensorflow versions.
For those who have the same issue, I did use a conda environment to install the following packages:
conda install cudatoolkit
conda install cudnn
and
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/YOURUSERNAME/anaconda3/pkgs/cudatoolkit-10.X.Y-Z/lib/
Now after starting the code I see:
2020-01-09 10:56:27.543353: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 10:56:27.544165: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-09 10:56:27.544862: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-09 10:56:27.545038: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-09 10:56:27.545970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-09 10:56:27.546671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-09 10:56:27.548886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
I tried training the code on a GPU, after including the changes made earlier today, I am having a memory issues. Just after
2020-01-08 16:11:33.715292: I tensorflow/compiler/jit/xla_compilation_cache.cc:238] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
message, the program crashes with aKilled
message.Here is an extended log for your attention:
My GPU specs:
Memory info:
and pip packages: