lattice-ai / DeepLabV3-Plus

Tensorflow 2.3.0 implementation of DeepLabV3-Plus
https://keras.io/examples/vision/deeplabv3_plus/
101 stars 23 forks source link

mobilenet V2 train fail #8

Open fanweiya opened 3 years ago

fanweiya commented 3 years ago

i use mobilenet V2 backbone, but train fail

[-] Importing tensorflow...
2021-01-14 13:49:10.317068: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[+] Done! Tensorflow version: 2.5.0-dev20201230
[-] Importing Deeplabv3plus Trainer class...
[-] Importing config files...
2021-01-14 13:49:11.537581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-14 13:49:11.591072: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-14 13:49:11.591101: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (alit-PowerEdge-T640): /proc/driver/nvidia/version does not exist
2021-01-14 13:49:11.591383: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0,/job:localhost/replica:0/task:0/device:GPU:1
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0,/job:localhost/replica:0/task:0/device:GPU:1
Train Images are good to go
[+] Data points in train dataset: 6400
Train Dataset: <PrefetchDataset shapes: ((16, 512, 512, 3), (16, 512, 512, 1)), types: (tf.float32, tf.float32)>
Train Images are good to go
Data points in train dataset: 1600
Val Dataset: <PrefetchDataset shapes: ((16, 512, 512, 3), (16, 512, 512, 1)), types: (tf.float32, tf.float32)>
2021-01-14 13:49:12.045387: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-01-14 13:49:12.045414: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-01-14 13:49:12.100790: I tensorflow/core/profiler/lib/profiler_session.cc:158] Profiler session tear down.
2021-01-14 13:49:12.268507: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_STRING
      type: DT_STRING
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
      }
      shape {
      }
    }
  }
}

2021-01-14 13:49:12.362496: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
2021-01-14 13:49:12.367114: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz
Epoch 1/100
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
Traceback (most recent call last):
  File "trainer.py", line 47, in <module>
    HISTORY = TRAINER.train()
  File "/data/deeplab/DeepLabV3-Plus/deeplabv3plus/train.py", line 191, in train
    epochs=self.config['epochs'], callbacks=callbacks
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/wandb/integration/keras/keras.py", line 119, in new_v2
    return old_v2(*args, **kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1135, in fit
    tmp_logs = self.train_function(iterator)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 797, in __call__
    result = self._call(*args, **kwds)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 841, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 695, in _initialize
    *args, **kwds))
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2998, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3390, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3235, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 998, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 603, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 985, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:840 train_function  *
        return step_function(self, iterator)
    /data/deeplab/DeepLabV3-Plus/deeplabv3plus/model/deeplabv3_plus.py:104 call  *
        tensor = tf.keras.layers.Concatenate(axis=-1)([input_a, input_b])
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:1015 __call__  **
        self._maybe_build(inputs)
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:2709 _maybe_build
        self.build(input_shapes)  # pylint:disable=not-callable
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py:273 wrapper
        output_shape = fn(instance, input_shape)
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/layers/merge.py:519 build
        raise ValueError(err_msg)

    ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(8, 128, 128, 256), (8, 64, 64, 48)]
jeremy-cv commented 3 years ago

Hi, thanks for this implementation.

I'm having the same issue with Mobilenetv2 backbone model.

@fanweiya did you solve this ?

thanks

jeremy-cv commented 3 years ago

It seems that training runs with factor 8 in ._get_upsample_layer_fn(input_shape, factor=8)

diogosilva30 commented 2 years ago

+1 Same error here

shivarajkarki commented 1 year ago

Same error ValueError: A Concatenate layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(1, 128, 128, 256), (1, 64, 64, 48)]

KozAAAAA commented 7 months ago

Has anyone managed to fix this?