isl-org / MiDaS

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"
MIT License
4.55k stars 634 forks source link

Getting very less FPS #97

Open AryanSethi opened 3 years ago

AryanSethi commented 3 years ago

I'm using this tflite model and running it on my PC using this script

import cv2
import tensorflow as tf
import urllib.request
import matplotlib.pyplot as plt
import numpy as np

interpreter = tf.lite.Interpreter(model_path="lite-model_midas_v2_1_small_1_lite_1.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape']

mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]

def startEst():
    running = True
    while running:
        cap = cv2.VideoCapture(0)
        _,frame = cap.read()
        img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) / 255.0
        processed = preprocessing(img)

        interpreter.set_tensor(input_details[0]['index'], processed)
        interpreter.invoke()
        output = interpreter.get_tensor(output_details[0]['index'])
        output = output.reshape(256, 256)

        prediction = cv2.resize(output, (img.shape[1], img.shape[0]), interpolation=cv2.INTER_CUBIC)
        depth_min = prediction.min()
        depth_max = prediction.max()
        img_out = (255 * (prediction - depth_min) / (depth_max - depth_min)).astype("uint8")
        cv2.imshow('ss',img_out)
        #cv2.imwrite("output.png", img_out)

        # plt.imshow(img_out)
        # plt.show()

        if cv2.waitKey(1)==ord('q'):
            running = False

def preprocessing(img):
    img_resized = tf.image.resize(img, [256, 256], method='bicubic', preserve_aspect_ratio=False)
    img_input = img_resized.numpy()
    img_input = (img_input - mean) / std
    reshape_img = (img_input.reshape(1, 256, 256, 3)).astype(np.float32)
    #tensor = tf.convert_to_tensor(reshape_img, dtype=tf.float32)
    return reshape_img

startEst()

But I'm getting a very low frame rate. I'm sure I'm doing something wrong, Can anyone help me out here?

3dsf commented 3 years ago

Hey, you haven't listed your fps or device and confirmation the framework is using your devices specialized processing unit.

Besides that, the resize operation can significantly affecting timing. The example image they used is 768 × 576. Two factors that I know will add time to the resize operation (which your code calls twice; don't know if the second call will be used), are size of the image and complexity of the image. You should confirm your fps using the example from the link.

AryanSethi commented 3 years ago

@3dsf 1- I'm running the script on my PC (i5 8th gen) but for some reason the GPU(nvidia GTX 1050 Ti) usage is shown to be 0, so I'm assuming that the script is running entirely on CPU

2-Even then I think the FPS is very low. I'm getting less than 2 FPS

3- You mentioned that they used 768 × 576 as the example image. The input image shape I'm using is (480, 640, 3), converting it to (256,256,3) and then feeding the input to the model

Here's the new script that I'm using, still getting less than 2 FPS

import cv2
import tensorflow as tf
import numpy as np

interpreter = tf.lite.Interpreter(model_path="lite-model_midas_v2_1_small_1_lite_1.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape']

mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]

def startEst():
    running = True
    while running:
        cap = cv2.VideoCapture(0)
        _,frame = cap.read()  #size of this frame is (480,640,3)

        processed = preprocessing(frame)

        interpreter.set_tensor(input_details[0]['index'], processed)
        interpreter.invoke()
        output = interpreter.get_tensor(output_details[0]['index'])
        prediction = output.reshape(256, 256, 1)

        depth_min = prediction.min()
        depth_max = prediction.max()
        img_out = (255 * (prediction - depth_min) / (depth_max - depth_min)).astype("uint8")
        cv2.imshow('img',img_out)

        if cv2.waitKey(1)==ord('q'):
            running = False

def preprocessing(img_input):
    img_input = cv2.resize(img_input, (256, 256))
    img_input = cv2.cvtColor(img_input, cv2.COLOR_BGR2RGB)
    img_input = img_input.astype(np.float32)/255
    reshape_img = np.expand_dims(img_input,axis=0)

    return reshape_img

startEst()

And as you said, I also used the code example from the link, still getting the same results.

3dsf commented 3 years ago

You are definitely is not using the GPU. Are you using a virtual environment? How did you install tensorflow?
Which cuda version are you using?

AryanSethi commented 3 years ago

@3dsf If you are asking that to check if I have tensorflow-gpu setup with CUDA drivers. Then yes, I've been using tensorfow-gpu for over a year now and all my models are trained using the GPU only. If i run this line in python

gpus = tf.config.list_physical_devices('GPU')
print(len(gpus))

I get 1 as output. My tensorflow-gpu is properly set up.

But for some weird reason, the GPU is not being used in case of a '.tflite' model

3dsf commented 3 years ago

Can you verify the time you get by running this code and verifying the Inference block ? same as your top link

I get ~0.53 seconds for the inference block (2060s; 3.1ghz processor); About 22 seconds for the whole thing. As a note, when running models in other frameswork, usually the first inference took significantly longer. Also .53 feel slow for this when comparing against the pytorch model, I'll test the image later.

AryanSethi commented 3 years ago

It took 0.5265460014343262 seconds to run the inference block and 74.80340027809143 seconds to run the entire script due to my slow internet.

This is quite slow. Prediction of a single frame is taking over half a second which means a maximum of 2 FPS. This is too less.

And about running the model first time, I ran mine for a couple of times but got the same results

3dsf commented 3 years ago

Sorry, I should rephrase, with pytorch and some other frameworks, the first inference always takes longer.

ex// run.py (pytorch)

start processing
  processing input/dog.jpg (1/21)
--- 0.5810694694519043 seconds ---
  processing input/dog.1.png (2/21)
--- 0.03547954559326172 seconds ---
  processing input/dog.to.png (3/21)
--- 0.03509330749511719 seconds ---
  processing input/dog.10.png (4/21)
--- 0.03508877754211426 seconds ---

ex// run_pb.py (tensorflow)

--- 44.24601650238037 seconds ---
  processing input/dog.2.png (2/10)
--- 0.11023926734924316 seconds ---
  processing input/dog.3.png (3/10)
--- 0.10619044303894043 seconds ---
  processing input/dog.4.png (4/10)
--- 0.10412073135375977 seconds ---
  processing input/dog.5.png (5/10)
--- 0.0918574333190918 seconds ---
  processing input/dog.6.png (6/10)
--- 0.08530664443969727 seconds ---
  processing input/dog.7.png (7/10)
--- 0.10563278198242188 seconds ---
  processing input/dog.8.png (8/10)
--- 0.09832310676574707 seconds ---
  processing input/dog.9.png (9/10)
--- 0.09035515785217285 seconds ---
  processing input/dog.10.png (10/10)
--- 0.09112739562988281 seconds ---

For reference, same image, different name



Now when I try it with tflite

--- 0.5367670059204102 seconds ---
 Write image to: output.png
  processing input/dog.2.png (2/21)
--- 0.42414355278015137 seconds ---
 Write image to: output.png
  processing input/dog.3.png (3/21)
--- 0.42281079292297363 seconds ---
 Write image to: output.png
  processing input/dog.4.png (4/21)
--- 0.42469120025634766 seconds ---
 Write image to: output.png
  processing input/dog.5.png (5/21)
--- 0.42205190658569336 seconds ---
 Write image to: output.png
  processing input/dog.6.png (6/21)
--- 0.4326660633087158 seconds ---
 Write image to: output.png
  processing input/dog.7.png (7/21)
--- 0.42022132873535156 seconds ---

I see improved timings, but they are still poor. Note I've been sloppy in implementing the timing around the inference blocks in this example and have excluded some pre/post processing, but I would still stand by this example to factor performance between different models.

Maybe consider switching frameworks if possible, or consult the tflite community for further guidance for a potentially faster implementation.

AryanSethi commented 3 years ago

The same model runs on android and iOS with over 20 FPS, I don't understand why it's running so poorly on my PC

3dsf commented 3 years ago

It probably is because of how tflite was compiled.

That is part of the reason I asked about how it was installed. On the stout, there were info lines suggesting there could be optimizations. I installed in a conda virtual environment using pip. I would imagine the results will be better if compiled from SRC with attention to the build flags relevant to the hardware.

Tflite was probably designed with mobile processors in mind. Additionally in the example of the apple processing chip, the data is probably not switched between devices (cpu--gpu--cpu) which can performance implications. I can't comment on Android, that could be anything.

3dsf commented 3 years ago

also maybe something like this will be able to help you https://coral.ai/products/accelerator