The inference speed on Jetson TX2

MrLinNing commented 5 years ago

Hello, @sacmehta I run the ESPNet on jetson TX2 and the JetPack SDK verson is 4.1.1, pytorch version is 4.0. I find that when the input image is 360x640, the inference time is about 0.112s which means FPS is less than 10. (I am sure without image loading and image writing time.) In your paper, the inference Speed is more than 16 when the image is 360x640. Can you give me more details about it? Beside, I use the erf_net code to measure the inference time of ESPNet, https://github.com/Eromera/erfnet_pytorch/blob/master/eval/eval_forwardTime.py

sacmehta commented 5 years ago

Could you please ensure that CUDA and CUDNN installed properly? Also, are you discarding the first iteration?

Also, set the GPU frequency to maximum.

P.S.: we used PyTorch v0.3

MrLinNing commented 5 years ago

@sacmehta Wow，thank you！ It‘s GPU frequency！

MrLinNing commented 5 years ago

Hi, do your meet this problem when test the ENet and ERFNet? I can not fix the bug,

RuntimeError: cuda runtime error (7) : too many resources requested for launch at /pytorch/torch/lib/THCUNN/im2col.h:120

sacmehta commented 5 years ago

No, I didn’t encounter this issue.

PS: I encountered issues with bilinear interpolation on TX2, so you might want to use deconvolution for upsampling.

tokyokuma commented 5 years ago

I trained ESPNetv 2 with my data set. I modified gen_cityscapes.py to work with my data (768 ✕ 432), and when I ran it, only about 5 FPS was out on Jetson TX 2. Running jetson_clocks.sh to maximize GPU frequency. In the paper, if the image size I used is more than 10 FPS, I do not know where the problem is. Help me!

Jetpack 3.1 python2.7 CUDA 8.0 cuDNN 6.0

Corrected source code

from __future__ import division
from __future__ import print_function
import numpy as np
import torch
import glob
import SegmentationModel as net
import time
import cv2
import os
from argparse import ArgumentParser
from torch import nn

pallete = [[153,153,153],
           [170,234,150],
           [220,220,  0],
           [107,142, 35],
           [152,251,152],
           [ 70,130,180],
           [220, 20, 60],
           [  0, 60,100],
           [150,250,250],
           [  0,  0,  0],
           [  0,  0,  0]]

def relabel(img):
    return img

def evaluateModel(args, model, image_list):
    # gloabl mean and std values
    mean = [131.84157, 145.38597, 135.16437]
    std =  [76.013596, 67.85283,  70.89791 ]

    model.eval()
    for i, imgName in enumerate(image_list):
        img = cv2.imread(imgName)
        if args.overlay:
            img_orig = np.copy(img)

        start = time.time()
        img = img.astype(np.float32)
        for j in range(3):
            img[:, :, j] -= mean[j]
        for j in range(3):
            img[:, :, j] /= std[j]

        img = cv2.resize(img, (args.inWidth, args.inHeight))
        if args.overlay:
            img_orig = cv2.resize(img_orig, (args.inWidth, args.inHeight))

        img /= 255
        img = img.transpose((2, 0, 1))
        img_tensor = torch.from_numpy(img)
        img_tensor = torch.unsqueeze(img_tensor, 0)  # add a batch dimension
        if args.gpu:
            img_tensor = img_tensor.cuda()
        img_out = model(img_tensor)

        classMap_numpy = img_out[0].max(0)[1].byte().cpu().data.numpy()
        # upsample the feature maps to the same size as the input image using Nearest neighbour interpolation
        # upsample the feature map from 1024x512 to 2048x1024
        #classMap_numpy = cv2.resize(classMap_numpy, (args.inWidth*2, args.inHeight*2), interpolation=cv2.INTER_NEAREST)
        if i % 100 == 0 and i > 0:
            print('Processed [{}/{}]'.format(i, len(image_list)))

        elapsed_time = time.time() - start
        print ('time')
        print (elapsed_time)

        name = imgName.split('/')[-1]
        if args.colored:
            classMap_numpy_color = np.zeros((img.shape[1], img.shape[2], img.shape[0]), dtype=np.uint8)
            for idx in range(len(pallete)):
                [r, g, b] = pallete[idx]
                classMap_numpy_color[classMap_numpy == idx] = [b, g, r]
            cv2.imwrite(args.savedir + os.sep + 'c_' + name.replace(args.img_extn, 'png'), classMap_numpy_color)
            if args.overlay:
                overlayed = cv2.addWeighted(img_orig, 0.5, classMap_numpy_color, 0.5, 0)
                cv2.imwrite(args.savedir + os.sep + 'over_' + name.replace(args.img_extn, 'jpg'), overlayed)

        if args.cityFormat:
            classMap_numpy = relabel(classMap_numpy.astype(np.uint8))

        cv2.imwrite(args.savedir + os.sep + name.replace(args.img_extn, 'png'), classMap_numpy)

def main(args):
    # read all the images in the folder
    image_list = glob.glob(args.data_dir + os.sep + '*.' + args.img_extn)

    modelA = net.EESPNet_Seg(args.classes, s=args.s)
    if not os.path.isfile(args.pretrained):
        print('Pre-trained model file does not exist. Please check ./pretrained_models folder')
        exit(-1)
    modelA = nn.DataParallel(modelA)
    modelA.load_state_dict(torch.load(args.pretrained))
    if args.gpu:
        modelA = modelA.cuda()

    # set to evaluation mode
    modelA.eval()

    if not os.path.isdir(args.savedir):
        os.mkdir(args.savedir)

    evaluateModel(args, modelA, image_list)

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--model', default="ESPNetv2", help='Model name')
    parser.add_argument('--data_dir', default="./izunuma", help='Data directory')
    parser.add_argument('--img_extn', default="png", help='RGB Image format')
    parser.add_argument('--inWidth', type=int, default=768, help='Width of RGB image')
    parser.add_argument('--inHeight', type=int, default=432, help='Height of RGB image')
    parser.add_argument('--savedir', default='./results', help='directory to save the results')
    parser.add_argument('--gpu', default=True, type=bool, help='Run on CPU or GPU. If TRUE, then GPU.')
    parser.add_argument('--pretrained', default='../models/izunuma_dataset9_0.5/model_best.pth', help='Pretrained weights directory.')
    parser.add_argument('--s', default=0.5, type=float, help='scale')
    parser.add_argument('--cityFormat', default=True, type=bool, help='If you want to convert to cityscape '
                                                                       'original label ids')
    parser.add_argument('--colored', default=True, type=bool, help='If you want to visualize the '
                                                                   'segmentation masks in color')
    parser.add_argument('--overlay', default=True, type=bool, help='If you want to visualize the '
                                                                   'segmentation masks overlayed on top of RGB image')
    parser.add_argument('--classes', default=11, type=int, help='Number of classes in the dataset. 20 for Cityscapes')

    args = parser.parse_args()
    if args.overlay:
        args.colored = True # This has to be true if you want to overlay
    main(args)

sacmehta commented 5 years ago

Are you accounting for image reading and writing time? If so, discard that.

tokyokuma commented 5 years ago

Thank you! Reading images and writing images were discarded from the processing time, and the processing speed improved up to 8 FPS. However, it has not reached more than 10 FPS yet.

What have I missed else?

sacmehta commented 5 years ago

I think GPU frequencies are not set properly. Could you run the below command and then test the speed:

sudo nvpmodel -m 0

tokyokuma commented 5 years ago

I have a lot of questions, sorry I tried sudo nvpmodel -m 0 but the processing speed still remained at 8.5 FPS.

I checked the status of GPU with tegrastats during processing

sudo ~/tegrastats RAM2969/7851MB(lfb880x4MB)cpu[1%@2028,72%@2034,26%@2036,1%@2024,2%@2026,2%@2027] EMC 12%@1866 APE 150 GR3D 47%@1300

The usage rate of GPU memory is about 50%, and it is not used up to the maximum. Does this indicate that the python code is defective?

sacmehta commented 5 years ago

PyTorch has an initialization time which is way too high. You need to discard the execution time of your first frame. If you are not doing so, please discard and check.

The other thing worth trying is to just pass a random tensor to the model and measure the time, similar to the script mentioned at the beginning of this thread.

tokyokuma commented 5 years ago

I trie to just pass a random tensor to the model and measure the time. It was almost the same processing speed in random tensor and images. Perhaps it is only this part to measure?

start = time.time() img_out = model(img_tensor) classMap_numpy = img_out[0].max(0)[1].byte().cpu().data.numpy() elapsed_time = time.time() - start

If so, it will result in about 12 FPS.

sacmehta commented 5 years ago

You actually measure the GPU time, so you only need to measure the model execution time.

img_out = model(img_tensor)

tokyokuma commented 5 years ago

Thank you! I misunderstood that we calculate not only the inference part but also a series of processing times.

sacmehta / ESPNet

The inference speed on Jetson TX2 #47