ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.79k stars 16.36k forks source link

Can't build TF saved_model or tflite with new v6 #5147

Closed YoniChechik closed 3 years ago

YoniChechik commented 3 years ago

🐛 Bug

Can't build TF saved_model or tflite with new v6

To Reproduce (REQUIRED)

Run the google colab- only need to run the below block:

# CI Checks
%%shell
export PYTHONPATH="$PWD"  # to run *.py. files in subdirectories
rm -rf runs  # remove runs/
for m in yolov5s; do  # models
  python train.py --weights $m.pt --epochs 3 --img 320 --device 0  # train pretrained
  python train.py --weights '' --cfg $m.yaml --epochs 3 --img 320 --device 0  # train scratch
  for d in 0; do # cpu; do  # devices
    python detect.py --weights $m.pt --device $d  # detect official
    python detect.py --weights runs/train/exp/weights/best.pt --device $d  # detect custom
    python val.py --weights $m.pt --device $d # val official
    python val.py --weights runs/train/exp/weights/best.pt --device $d # val custom
  done
  python hubconf.py  # hub
  python models/yolo.py --cfg $m.yaml  # build PyTorch model
  python models/tf.py --weights $m.pt  # build TensorFlow model
  python export.py --img 128 --batch 1 --weights $m.pt --include tflite #torchscript onnx  # export
done

output: missing SPPF in tf.py

Environment

google colab + gpu

Some debug:

For start the SPPF is missing from tf.py:

from models.common import Conv, Bottleneck, SPP, DWConv, Focus, BottleneckCSP, Concat, autopad, C3, **SPPF**

...

class TFSPPF(keras.layers.Layer):
    # Spatial pyramid pooling-Fast layer
    def __init__(self, c1, c2, k=5, w=None):
        super(TFSPPF, self).__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = TFConv(c1, c_, 1, 1, w=w.cv1)
        self.cv2 = TFConv(c_ * 4, c2, 1, 1, w=w.cv2)
        self.m = keras.layers.MaxPool2D(pool_size=k, strides=1, padding='SAME')

    def call(self, inputs):
        x = self.cv1(inputs)
        y1 = self.m(x)
        y2 = self.m(y1)
        return self.cv2(tf.concat([x, y1, y2, self.m(y2)], 3))

After building the TFSPPF class we are still getting this new error:

Eager execution of tf.constant with unsupported shape (value has 131072 elements, shape is (1, 1, 512, 512) with 262144 elements)
github-actions[bot] commented 3 years ago

👋 Hello @YoniChechik, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

jonatanbarkan commented 3 years ago

I have the same problem, can't transfer to tensorflow

in colab:

!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt  # install dependencies

import torch
from IPython.display import Image, clear_output  # to display images

clear_output()
print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

!python models/tf.py --weights yolov5s.pt  # build TensorFlow model

output:

tf: weights=yolov5s.pt, imgsz=[640, 640], batch_size=1, dynamic=False Model Summary: 270 layers, 7235389 parameters, 0 gradients, 16.5 GFLOPs 2021-10-12 10:37:24.376296: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.385962: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.386819: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.388245: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.389061: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.389883: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.868789: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.869670: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.870501: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-12 10:37:24.871224: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2021-10-12 10:37:24.871306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10818 MB memory: -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7

             from  n    params  module                                  arguments                     

0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
512 1024

9 -1 1 535562 models.common.SPPF [1024, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512], [640, 640]] 2021-10-12 10:37:25.517617: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005 Traceback (most recent call last): File "models/tf.py", line 474, in main(opt) File "models/tf.py", line 469, in main run(vars(opt)) File "models/tf.py", line 447, in run y = tf_model.predict(im) # inference File "models/tf.py", line 359, in predict x = m(x) # run File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 1037, in call outputs = call_fn(inputs, *args, *kwargs) File "models/tf.py", line 199, in call x = self.cv1(inputs) File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 1037, in call outputs = call_fn(inputs, args, kwargs) File "models/tf.py", line 92, in call return self.act(self.bn(self.conv(inputs))) File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 1030, in call self._maybe_build(inputs) File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 2659, in _maybe_build self.build(input_shapes) # pylint:disable=not-callable File "/usr/local/lib/python3.7/dist-packages/keras/layers/convolutional.py", line 204, in build dtype=self.dtype) File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 663, in add_weight caching_device=caching_device) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/tracking/base.py", line 818, in _add_variable_with_custom_getter kwargs_for_getter) File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_utils.py", line 129, in make_variable shape=variable_shape if variable_shape else None) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 266, in call return cls._variable_v1_call(*args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 227, in _variable_v1_call shape=shape) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 205, in previous_getter = lambda kwargs: default_variable_creator(None, *kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2626, in default_variable_creator shape=shape) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 270, in call return super(VariableMetaclass, cls).call(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 1613, in init distribute_strategy=distribute_strategy) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 1740, in _init_from_args initial_value = initial_value() File "/usr/local/lib/python3.7/dist-packages/keras/initializers/initializers_v2.py", line 227, in call self.value, dtype=_get_dtype(dtype), shape=shape) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 272, in constant allow_broadcast=True) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/constant_op.py", line 332, in _constant_eager_impl (num_t, shape, shape.num_elements())) TypeError: Eager execution of tf.constant with unsupported shape (value has 131072 elements, shape is (1, 1, 512, 512) with 262144 elements).

jonatanbarkan commented 3 years ago

solution:

models/tf.py : add SPPF to line 31 from models.common import Conv, Bottleneck, SPP, SPPf, DWConv, Focus, BottleneckCSP, Concat, autopad, C3

add SPPF to line 275 (parse_opt function) if m in [nn.Conv2d, Conv, Bottleneck, SPP. SPPF, DWConv, MixConv2d, Focus, CrossConv, BottleneckCSP, C3]:

add class TFSPPF:

class TFSPPF(keras.layers.Layer):
    # Spatial pyramid pooling layer used in YOLOv3-SPP
    def __init__(self, c1, c2, k=5, w=None):
        super(TFSPPF, self).__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = TFConv(c1, c_, 1, 1, w=w.cv1)
        self.cv2 = TFConv(c_ * 4, c2, 1, 1, w=w.cv2)
        self.m = keras.layers.MaxPool2D(pool_size=k, strides=1, padding="SAME")

    def call(self, inputs):
        x = self.cv1(inputs)
        y1 = self.m(x)
        y2 = self.m(y1)
        y3 = self.m(y2)
        return self.cv2(tf.concat([x, y1, y2, y3], 3))
glenn-jocher commented 3 years ago

@YoniChechik @jonatanbarkan good news 😃! Your original issue may now be fixed ✅ in PR #5147 by @YoniChechik . To receive this update:

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!