hhk7734 / tensorflow-yolov4

YOLOv4 Implemented in Tensorflow 2.
MIT License
136 stars 75 forks source link

Significant FPS drop and issues with tracking #26

Open JimBratsos opened 4 years ago

JimBratsos commented 4 years ago

Good evening, I have been trying using the converted model today for object detection with deepsort, without result. Before that, I tried testing it as underlined by you, using the inference command. However, when used with videos it takes a huge amount of time to change the frame and track the changes. As for deepsort, I referred to https://github.com/theAIGuysCode/yolov4-deepsort and his tracker script, only to provide the following error: ValueError: Shapes (1, 19, 19) and (1, 38, 38) are incompatible After that, I tried running the above script ( basically the same as hunglc007's script ) with the following correction for int8 models, as specified here https://github.com/hunglc007/tensorflow-yolov4-tflite/issues/214 ( I recall you have referenced someone at one issue at this ). I tried running it, and it got me the following error:

File "object_tracker.py", line 127, in main
    output_tensors = decode(pred[2], input_size // 8, NUM_CLASS, STRIDES, ANCHORS, i, XYSCALE, 'tflite')
IndexError: list index out of range

Should I swap the number 2 with 1 or 0, it will eventually bring up an image, with an extremely inaccurate detection. I haven't tried this with video, for safety purposes :P ...

These issues and the fps are the crucial issues for me. Thank you for your great work though, the conversion is successful and the model is working.

hhk7734 commented 4 years ago

What model did you use? And can you share your script?

hhk7734 commented 4 years ago

I don't know what is pred. yolo.inference() has no return and yolo.predict() return pred_bboxes == Dim(-1, (x, y, w, h, class_id, probability))

If you want (1,19,19,x) shape, use yolo.model.predict()

hhk7734 commented 4 years ago

And ref: https://github.com/hhk7734/tensorflow-yolov4/issues/23#issuecomment-687859586

To speed up, I'll test it out ASAP.

JimBratsos commented 4 years ago

I use the yolov4-tiny with relu activation, that is converted to tflite. From what I remembered from netron it has 2 outputs. The script I am using uses 3 outputs, thus the 2nd issue I am facing probably. Here is the script:

import os
# comment out below line to enable tensorflow logging outputs
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import time
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
from absl import app, flags, logging
from absl.flags import FLAGS
import core.utils as utils
from core.yolov4 import decode,filter_boxes
from tensorflow.python.saved_model import tag_constants
from core.config import cfg
from PIL import Image
import cv2
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

# deep sort imports
from deep_sort import preprocessing, nn_matching
from deep_sort.detection import Detection
from deep_sort.tracker import Tracker
from tools import generate_detections as gdet

flags.DEFINE_string('framework', 'tf', '(tf, tflite, trt')
flags.DEFINE_string('weights', './checkpoints/yolov4-416',
                    'path to weights file')
flags.DEFINE_integer('size', 416, 'resize images to')
flags.DEFINE_boolean('tiny', False, 'yolo or yolo-tiny')
flags.DEFINE_string('model', 'yolov4', 'yolov3 or yolov4')
flags.DEFINE_string('video', './data/video/test.mp4', 'path to input video or set to 0 for webcam')
flags.DEFINE_string('output', None, 'path to output video')
flags.DEFINE_string('output_format', 'XVID', 'codec used in VideoWriter when saving video to file')
flags.DEFINE_float('iou', 0.45, 'iou threshold')
flags.DEFINE_float('score', 0.50, 'score threshold')
flags.DEFINE_boolean('dont_show', False, 'dont show video output')
flags.DEFINE_boolean('info', False, 'show detailed info of tracked objects')
flags.DEFINE_boolean('count', False, 'count objects being tracked on screen')

def main(_argv):
    # Definition of the parameters
    max_cosine_distance = 0.4
    nn_budget = None
    nms_max_overlap = 1.0

    # initialize deep sort
    model_filename = 'model_data/mars-small128.pb'
    encoder = gdet.create_box_encoder(model_filename, batch_size=1)
    # calculate cosine distance metric
    metric = nn_matching.NearestNeighborDistanceMetric("cosine", max_cosine_distance, nn_budget)
    # initialize tracker
    tracker = Tracker(metric)

    # load configuration for object detector
    config = ConfigProto()
    config.gpu_options.allow_growth = True
    session = InteractiveSession(config=config)
    STRIDES, ANCHORS, NUM_CLASS, XYSCALE = utils.load_config(FLAGS)
    input_size = FLAGS.size
    video_path = FLAGS.video

    # load tflite model if flag is set
    if FLAGS.framework == 'tflite':
        interpreter = tf.lite.Interpreter(model_path=FLAGS.weights)
        interpreter.allocate_tensors()
        input_details = interpreter.get_input_details()
        output_details = interpreter.get_output_details()
        print(input_details)
        print(output_details)
    # otherwise load standard tensorflow saved model
    else:
        saved_model_loaded = tf.saved_model.load(FLAGS.weights, tags=[tag_constants.SERVING])
        infer = saved_model_loaded.signatures['serving_default']

    # begin video capture
    try:
        vid = cv2.VideoCapture(int(video_path))
    except:
        vid = cv2.VideoCapture(video_path)

    out = None

    # get video ready to save locally if flag is set
    if FLAGS.output:
        # by default VideoCapture returns float instead of int
        width = int(vid.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT))
        fps = int(vid.get(cv2.CAP_PROP_FPS))
        codec = cv2.VideoWriter_fourcc(*FLAGS.output_format)
        out = cv2.VideoWriter(FLAGS.output, codec, fps, (width, height))

    # while video is running
    while True:
        return_value, frame = vid.read()
        if return_value:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            image = Image.fromarray(frame)
        else:
            print('Video has ended or failed, try a different video format!')
            break

        frame_size = frame.shape[:2]
        image_data = cv2.resize(frame, (input_size, input_size))
        image_data = image_data / 255.
        image_data = image_data[np.newaxis, ...].astype(np.float32)
        start_time = time.time()

        # run detections on tflite if flag is set
        if FLAGS.framework == 'tflite':
            interpreter = tf.lite.Interpreter(model_path=FLAGS.weights)
            interpreter.allocate_tensors()
            input_details = interpreter.get_input_details()
            output_details = interpreter.get_output_details()
            print(input_details)
            print(output_details)
            interpreter.set_tensor(input_details[0]['index'], image_data)
            interpreter.invoke()
            pred = [interpreter.get_tensor(output_details[i]['index']) for i in range(len(output_details))]
            # add post process code here
            bbox_tensors = []
            prob_tensors = []
            for i, fm in enumerate(pred):
                if i == 0:
                    output_tensors = decode(pred[2], input_size // 8, NUM_CLASS, STRIDES, ANCHORS, i, XYSCALE, 'tflite')
                elif i == 1:
                    output_tensors = decode(pred[0], input_size // 16, NUM_CLASS, STRIDES, ANCHORS, i, XYSCALE, 'tflite')
                else:
                    output_tensors = decode(pred[1], input_size // 32, NUM_CLASS, STRIDES, ANCHORS, i, XYSCALE, 'tflite')
                bbox_tensors.append(output_tensors[0])
                prob_tensors.append(output_tensors[1])
            pred_bbox = tf.concat(bbox_tensors, axis=1)
            pred_prob = tf.concat(prob_tensors, axis=1)
            pred = (pred_bbox, pred_prob)

            if FLAGS.model == 'yolov3' and FLAGS.tiny == True:
                boxes, pred_conf = filter_boxes(pred[1], pred[0], score_threshold=0.25, input_shape=tf.constant([input_size, input_size]))
            else:
                boxes, pred_conf = filter_boxes(pred[0], pred[1], score_threshold=0.25, input_shape=tf.constant([input_size, input_size]))
        else:
            batch_data = tf.constant(image_data)
            pred_bbox = infer(batch_data)
            for key, value in pred_bbox.items():
                boxes = value[:, :, 0:4]
                pred_conf = value[:, :, 4:]

        boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression(
            boxes=tf.reshape(boxes, (tf.shape(boxes)[0], -1, 1, 4)),
            scores=tf.reshape(
                pred_conf, (tf.shape(pred_conf)[0], -1, tf.shape(pred_conf)[-1])),
            max_output_size_per_class=50,
            max_total_size=50,
            iou_threshold=FLAGS.iou,
            score_threshold=FLAGS.score
        )

        # convert data to numpy arrays and slice out unused elements
        num_objects = valid_detections.numpy()[0]
        bboxes = boxes.numpy()[0]
        bboxes = bboxes[0:int(num_objects)]
        scores = scores.numpy()[0]
        scores = scores[0:int(num_objects)]
        classes = classes.numpy()[0]
        classes = classes[0:int(num_objects)]

        # format bounding boxes from normalized ymin, xmin, ymax, xmax ---> xmin, ymin, width, height
        original_h, original_w, _ = frame.shape
        bboxes = utils.format_boxes(bboxes, original_h, original_w)

        # store all predictions in one parameter for simplicity when calling functions
        pred_bbox = [bboxes, scores, classes, num_objects]

        # read in all class names from config
        class_names = utils.read_class_names(cfg.YOLO.CLASSES)

        # by default allow all classes in .names file
        allowed_classes = list(class_names.values())

        # custom allowed classes (uncomment line below to customize tracker for only people)
        #allowed_classes = ['person']

        # loop through objects and use class index to get class name, allow only classes in allowed_classes list
        names = []
        deleted_indx = []
        for i in range(num_objects):
            class_indx = int(classes[i])
            class_name = class_names[class_indx]
            if class_name not in allowed_classes:
                deleted_indx.append(i)
            else:
                names.append(class_name)
        names = np.array(names)
        count = len(names)
        if FLAGS.count:
            cv2.putText(frame, "Objects being tracked: {}".format(count), (5, 35), cv2.FONT_HERSHEY_COMPLEX_SMALL, 2, (0, 255, 0), 2)
            print("Objects being tracked: {}".format(count))
        # delete detections that are not in allowed_classes
        bboxes = np.delete(bboxes, deleted_indx, axis=0)
        scores = np.delete(scores, deleted_indx, axis=0)

        # encode yolo detections and feed to tracker
        features = encoder(frame, bboxes)
        detections = [Detection(bbox, score, class_name, feature) for bbox, score, class_name, feature in zip(bboxes, scores, names, features)]

        #initialize color map
        cmap = plt.get_cmap('tab20b')
        colors = [cmap(i)[:3] for i in np.linspace(0, 1, 20)]

        # run non-maxima supression
        boxs = np.array([d.tlwh for d in detections])
        scores = np.array([d.confidence for d in detections])
        classes = np.array([d.class_name for d in detections])
        indices = preprocessing.non_max_suppression(boxs, classes, nms_max_overlap, scores)
        detections = [detections[i] for i in indices]       

        # Call the tracker
        tracker.predict()
        tracker.update(detections)

        # update tracks
        for track in tracker.tracks:
            if not track.is_confirmed() or track.time_since_update > 1:
                continue 
            bbox = track.to_tlbr()
            class_name = track.get_class()

        # draw bbox on screen
            color = colors[int(track.track_id) % len(colors)]
            color = [i * 255 for i in color]
            cv2.rectangle(frame, (int(bbox[0]), int(bbox[1])), (int(bbox[2]), int(bbox[3])), color, 2)
            cv2.rectangle(frame, (int(bbox[0]), int(bbox[1]-30)), (int(bbox[0])+(len(class_name)+len(str(track.track_id)))*17, int(bbox[1])), color, -1)
            cv2.putText(frame, class_name + "-" + str(track.track_id),(int(bbox[0]), int(bbox[1]-10)),0, 0.75, (255,255,255),2)

        # if enable info flag then print details about each track
            if FLAGS.info:
                print("Tracker ID: {}, Class: {},  BBox Coords (xmin, ymin, xmax, ymax): {}".format(str(track.track_id), class_name, (int(bbox[0]), int(bbox[1]), int(bbox[2]), int(bbox[3]))))

        # calculate frames per second of running detections
        fps = 1.0 / (time.time() - start_time)
        print("FPS: %.2f" % fps)
        result = np.asarray(frame)
        result = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)

        if not FLAGS.dont_show:
            cv2.imshow("Output Video", result)

        # if output flag is set, save video file
        if FLAGS.output:
            out.write(result)
        if cv2.waitKey(1) & 0xFF == ord('q'): break
    cv2.destroyAllWindows()

if __name__ == '__main__':
    try:
        app.run(main)
    except SystemExit:
        pass

The error occurs at the tflite area, although I posted the whole script since it might prove useful for others too. Thanks a lot for your help

JimBratsos commented 4 years ago

Update: I have looked at this code more these days, and Ive noticed that it is made specifically for tflite models with 3 outputs/branches, while my yolov4-tiny model has 2 outputs. I will see how I can modify the above script to run my model, but still the speed ( fps ) are extremely low ( 0.15 fps with inference ). Any idea on how to fix that part?

hhk7734 commented 4 years ago

It's only 0.15? on Coral?

JimBratsos commented 4 years ago

Sorry, the FPS on Coral is 0.45. Still relatively low, not that big of an improvement.

hhk7734 commented 4 years ago

HW: AMD Ryzen 7 2700X video: https://github.com/theAIGuysCode/yolov4-deepsort/blob/master/data/video/test.mp4 using only CPU

I think the computation time excluding inference is too long.

How to install scipy on Coral?

Result

FPS: 3.09, inference: 0.13 s, compute: 0.32
FPS: 3.06, inference: 0.14 s, compute: 0.33
FPS: 3.08, inference: 0.13 s, compute: 0.32
FPS: 2.93, inference: 0.13 s, compute: 0.34
FPS: 3.06, inference: 0.13 s, compute: 0.33
FPS: 3.18, inference: 0.13 s, compute: 0.31
FPS: 2.87, inference: 0.13 s, compute: 0.35

Script

import time

import cv2
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from yolov4.tf import YOLOv4

from deep_sort import preprocessing, nn_matching
from deep_sort.detection import Detection
from deep_sort.tracker import Tracker
from tools import generate_detections as gdet

yolo = YOLOv4(tiny=True)
yolo.classes = "dataset/coco.names"
yolo.make_model(activation1="relu")
yolo.load_weights(
    r"C:\Users\windows\google_drive\Hard_Soft\NN\yolov4\yolov4-tiny-relu.weights",
    weights_type="yolo",
)

# Definition of the parameters
max_cosine_distance = 0.4
nn_budget = None
nms_max_overlap = 1.0

# initialize deep sort
model_filename = "model_data/mars-small128.pb"
encoder = gdet.create_box_encoder(model_filename, batch_size=1)
# calculate cosine distance metric
metric = nn_matching.NearestNeighborDistanceMetric(
    "cosine", max_cosine_distance, nn_budget
)
# initialize tracker
tracker = Tracker(metric)

# load configuration for object detector
input_size = yolo.input_size
video_path = r"C:/Users/windows/Desktop/test.mp4"

# begin video capture
vid = cv2.VideoCapture(video_path)

out = None

# initialize color map
cmap = plt.get_cmap("tab20b")
colors = [cmap(i)[:3] for i in np.linspace(0, 1, 20)]

# while video is running
while True:
    start_time = time.time()

    return_value, frame = vid.read()
    if return_value:
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(frame)
    else:
        print("Video has ended or failed, try a different video format!")
        break

    original_h, original_w, _ = frame.shape

    # (x, y, w, h, class_id, probability)
    _bboxes = yolo.predict(frame)

    mid_time = time.time()

    # convert data to numpy arrays and slice out unused elements
    # format bounding boxes from normalized ymin, xmin, ymax, xmax ---> xmin, ymin, width, height
    num_objects = len(_bboxes)
    bboxes = [
        [
            (box[0] - box[2] / 2) * original_w,
            (box[1] - box[3] / 2) * original_h,
            box[2] * original_w,
            box[3] * original_h,
        ]
        for box in _bboxes
    ]
    bboxes = np.array(bboxes)
    scores = np.array([box[5] for box in _bboxes])
    classes = np.array([int(box[4]) for box in _bboxes])

    # store all predictions in one parameter for simplicity when calling functions
    pred_bbox = [bboxes, scores, classes, num_objects]

    # read in all class names from config
    class_names = yolo.classes

    # by default allow all classes in .names file
    # allowed_classes = list(class_names.values())

    # custom allowed classes (uncomment line below to customize tracker for only people)
    allowed_classes = ["person", "bicycle"]

    # loop through objects and use class index to get class name, allow only classes in allowed_classes list
    names = []
    deleted_indx = []
    for i in range(num_objects):
        class_indx = classes[i]
        class_name = class_names[class_indx]
        if class_name not in allowed_classes:
            deleted_indx.append(i)
        else:
            names.append(class_name)
    names = np.array(names)
    count = len(names)

    # delete detections that are not in allowed_classes
    bboxes = np.delete(bboxes, deleted_indx, axis=0)

    # encode yolo detections and feed to tracker
    features = encoder(frame, bboxes)
    detections = [
        Detection(bbox, score, class_name, feature)
        for bbox, score, class_name, feature in zip(
            bboxes, scores, names, features
        )
    ]

    # run non-maxima supression
    boxs = np.array([d.tlwh for d in detections])
    scores = np.array([d.confidence for d in detections])
    classes = np.array([d.class_name for d in detections])
    indices = preprocessing.non_max_suppression(
        boxs, classes, nms_max_overlap, scores
    )
    detections = [detections[i] for i in indices]

    # Call the tracker
    tracker.predict()
    tracker.update(detections)

    # update tracks
    for track in tracker.tracks:
        if not track.is_confirmed() or track.time_since_update > 1:
            continue
        bbox = track.to_tlbr()
        class_name = track.get_class()

        # draw bbox on screen
        color = colors[int(track.track_id) % len(colors)]
        color = [i * 255 for i in color]
        cv2.rectangle(
            frame,
            (int(bbox[0]), int(bbox[1])),
            (int(bbox[2]), int(bbox[3])),
            color,
            2,
        )
        cv2.rectangle(
            frame,
            (int(bbox[0]), int(bbox[1] - 30)),
            (
                int(bbox[0])
                + (len(class_name) + len(str(track.track_id))) * 17,
                int(bbox[1]),
            ),
            color,
            -1,
        )
        cv2.putText(
            frame,
            class_name + "-" + str(track.track_id),
            (int(bbox[0]), int(bbox[1] - 10)),
            0,
            0.75,
            (255, 255, 255),
            2,
        )

    # calculate frames per second of running detections
    fps = 1.0 / (time.time() - start_time)
    print(
        "FPS: {:.2f}, inference: {:.2f} s, compute: {:.2f}".format(
            fps, mid_time - start_time, 1 / fps
        )
    )
    result = np.asarray(frame)
    result = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)

    cv2.imshow("Output Video", result)

    # if output flag is set, save video file
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cv2.destroyAllWindows()
JimBratsos commented 4 years ago

Hey, I might have made things unclear a bit. I first said I was trying to use the above code initially on my pc, to test how the model would run with deepsort. I could not run it, due to the model having 2 outputs instead of 3 that was at the script I've sent you.

I tested the model with the inference you provided in my pc, giving 0.15 fps and then at Coral, giving 0.45 fps. It is 3x better but still extremely low. At my PC test I used my GPU ( Gtx 1660 super ).

Sorry for the misunderstanding. As for Coral, I do not think there is a way to install scipy on it at the moment, so I might just go with kalman trackers or basic centroid tracking. What bothers me a bit though is the aforementioned low FPS issue.

JimBratsos commented 4 years ago

I modified the script above, and I can say that it works with tflite models now, which is a positive result. The drawback is that it still has extremely low FPS, at the point that the window stops responding:

FPS: 0.03, inference: 35.98 s, compute: 36.15
FPS: 0.03, inference: 36.02 s, compute: 36.12
FPS: 0.03, inference: 36.01 s, compute: 36.13

Thanks for the script ( Tested on GPU )

asen16 commented 3 years ago

Update: I have looked at this code more these days, and Ive noticed that it is made specifically for tflite models with 3 outputs/branches, while my yolov4-tiny model has 2 outputs. I will see how I can modify the above script to run my model, but still the speed ( fps ) are extremely low ( 0.15 fps with inference ). Any idea on how to fix that part?

Could you explain how to solve this problem? I got same error: ValueError: Shapes (1, 19, 19) and (1, 38, 38) are incompatible