tensorflow / models

Models and examples built with TensorFlow
Other
77.04k stars 45.77k forks source link

TPU not working with Object Detection 2.0 #9059

Open ya0002 opened 4 years ago

ya0002 commented 4 years ago

I have used Object Detection successfully with GPU, but am facing issues using the TPU provided with colab. Used this COLAB NOTEBOOK

ISSUE

When I run

!python /content/models/research/object_detection/model_main_tf2.py \
    --pipeline_config_path={pipeline_file} \
    --model_dir={model_dir} \
    --alsologtostderr \
    --num_train_steps={num_steps} \
    --sample_1_of_n_eval_examples=1 \
    --num_eval_steps={num_eval_steps} \
    --use_tpu=True \

I get

2020-08-06 09:35:44.046378: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-06 09:35:46.345932: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-06 09:35:46.349674: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-08-06 09:35:46.349713: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (807938737dd9): /proc/driver/nvidia/version does not exist
2020-08-06 09:35:46.356201: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2250000000 Hz
2020-08-06 09:35:46.356482: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2cdb480 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-06 09:35:46.356514: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-06 09:35:46.361480: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.19.81.18:8470}
2020-08-06 09:35:46.361526: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30597}
2020-08-06 09:35:46.378403: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.19.81.18:8470}
2020-08-06 09:35:46.378464: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30597}
2020-08-06 09:35:46.378919: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30597
I0806 09:35:46.379770 139847131113344 remote.py:218] Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
INFO:tensorflow:Initializing the TPU system: grpc://10.19.81.18:8470
I0806 09:35:46.382065 139847131113344 tpu_strategy_util.py:73] Initializing the TPU system: grpc://10.19.81.18:8470
INFO:tensorflow:Clearing out eager caches
I0806 09:35:58.021672 139847131113344 tpu_strategy_util.py:108] Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
I0806 09:35:58.023661 139847131113344 tpu_strategy_util.py:131] Finished initializing TPU system.
W0806 09:35:58.024306 139847131113344 tpu_strategy.py:320] `tf.distribute.experimental.TPUStrategy` is deprecated, please use  the non experimental symbol `tf.distribute.TPUStrategy` instead.
INFO:tensorflow:Found TPU system:
I0806 09:35:58.025231 139847131113344 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0806 09:35:58.025376 139847131113344 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0806 09:35:58.026053 139847131113344 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0806 09:35:58.026173 139847131113344 tpu_system_metadata.py:163] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0806 09:35:58.026299 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0806 09:35:58.026766 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0806 09:35:58.026932 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I0806 09:35:58.027048 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I0806 09:35:58.027171 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I0806 09:35:58.027285 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I0806 09:35:58.027393 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I0806 09:35:58.027498 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I0806 09:35:58.027603 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I0806 09:35:58.027709 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I0806 09:35:58.027831 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I0806 09:35:58.027937 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0806 09:35:58.028042 139847131113344 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:Maybe overwriting train_steps: 50000
I0806 09:35:58.032555 139847131113344 config_util.py:552] Maybe overwriting train_steps: 50000
INFO:tensorflow:Maybe overwriting use_bfloat16: True
I0806 09:35:58.032708 139847131113344 config_util.py:552] Maybe overwriting use_bfloat16: True
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0806 09:35:58.100114 139847131113344 dataset_builder.py:83] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
W0806 09:35:58.104112 139847131113344 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W0806 09:35:58.118720 139847131113344 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W0806 09:36:05.840581 139847131113344 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
W0806 09:36:09.165812 139847131113344 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/object_detection/inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0806 09:36:11.185652 139847131113344 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/object_detection/inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
Traceback (most recent call last):
  File "/content/models/research/object_detection/model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/content/models/research/object_detection/model_main_tf2.py", line 110, in main
    record_summaries=FLAGS.record_summaries)
  File "/usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py", line 561, in train_loop
    unpad_groundtruth_tensors)
  File "/usr/local/lib/python3.6/dist-packages/object_detection/model_lib_v2.py", line 342, in load_fine_tune_checkpoint
    features, labels = iter(input_dataset).next()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1199, in __iter__
    enable_legacy_iterators)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1752, in _create_iterators_per_worker
    worker_devices)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1609, in __init__
    devices)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1448, in __init__
    self._make_iterator()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1619, in _make_iterator
    self._dataset, self._devices, source_device=host_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 547, in __init__
    dataset.element_spec)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 54, in __init__
    init_func_concrete = _init_func.get_concrete_function()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2939, in get_concrete_function
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2906, in _get_concrete_function_garbage_collected
    graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3075, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 991, in func_graph_from_py_func
    expand_composites=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/nest.py", line 635, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/nest.py", line 635, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 942, in convert
    x = ops.convert_to_tensor_or_composite(x)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1622, in convert_to_tensor_or_composite
    value=value, dtype=dtype, name=name, as_ref=False)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1661, in internal_convert_to_tensor_or_composite
    accepted_result_types=(Tensor, composite_tensor.CompositeTensor))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1467, in convert_to_tensor
    return graph.capture(value, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 624, in capture
    return self.capture_eager_tensor(tensor, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 721, in capture_eager_tensor
    graph_const = constant_op.constant(tensor.numpy(), dtype=tensor.dtype,
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1063, in numpy
    maybe_arr = self._numpy()  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1031, in _numpy
    six.raise_from(core._status_to_exception(e.code, e.message), None)  # pylint: disable=protected-access
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme '[local]' not implemented (file: '/content/training/train')
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/tpu_strategy.py", line 540, in async_wait
    context.async_wait()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2319, in async_wait
    context().sync_executors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 658, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme '[local]' not implemented (file: '/content/training/train')
    Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2020-08-06 09:36:13.779208: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 227, Output num: 0
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1596706573.775492704","description":"Error received from peer ipv4:10.19.81.18:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 227, Output num: 0","grpc_status":3}

I need to train on TPU since I'm maxing out my time in GPU(taking more than 12 hours).

wangcongcong123 commented 4 years ago

I got the same issue. Any solutions?

ya0002 commented 4 years ago

Nope.

bmd-drepecka commented 3 years ago

I believe it is due to the fact that when using TPU you cannot use local file system and have to use GCP buckets: https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem

impaul98 commented 3 years ago

Got the same issue. My data is taken from GCP bucket.