Query on OD algo memory utilization in Xavier Developer kit

Niran89 commented 5 years ago

Dear Naisy,

Thank you for the great work !!!

I am currently working on an application which runs ssd mobilenet based Object detector in Nvidia AGX Xavier Developer kit, which is flashed with Jetpack 4.1.

The sofware specifications are as follows, Ubuntu 18.04 tf = 1.13.0 from "pip install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v411 tensorflow-gpu" CUDA 10.0 CUDNN 7.3.1.20

I have cloned the master version from your repo and ran it in AGX Xavier Developer Kit as such without any modifications in the code as well as config file. The algorithm runs fine. But, I noticed that the algorithm is approximately taking around 10GB (which is very huge) out of 15.5GB RAM memory available in Xavier. (memory increased from 2.4GB to 11.1GB while runing the algorithm). Confirmed this by running "tegrastats" in the terminal. The clock is set and mode is also set to max in Xavier.

The same algorithm allocates only about 1.5- 2GB in GPU (GeForce GTx 1060)based Laptop.

Can you help me understand on why there is a huge increase in memory in Xavier? Is this expected behavior? Are there any changes to be done in the code specific to Xavier ?

Also is it possible to run 2 different instance(with 2 cameras) of the code in Xavier ?

Thanks in Advance !!!

Regards, Niran

naisy commented 5 years ago

Hi @Niran89,

Jetson's(maybe AARCH64's) Tensorflow uses a lot of memory to prepare for CUDA. As Xavier has enough memory, using 10 GB is not a problem. In config.yml you can set worker_threads for Mask RCNN. Please try this value between 1 and 4. You can see how the amount of memory required for Mask RCNN detection part.

I do not use two cameras, so you will need to write the code.

Niran89 commented 5 years ago

Dear Naisy,

Thank you for the response.

The model I am using is ssd_mobilenet. Anyways I will check for Mask RCNN as you suggested.

Regarding the memory issue, I understand that using 10GB of memory is fine for the single instance. Since Xavier has almost 15GB of memory, my plan is to create 2 different workspace of the same algo and run 2 instance of it with 2 different cameras as input. I tried running 2 instances of the algo. The first instance runs fine but occupies almost 10GB out of 15GB. When I try to run the second instance with the remaining memory, the first instance gets killed or the entire system hangs due to memory issue.

From various discussion forums , I found that using "config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False), config.gpu_options.allow_growth = True" helps us to control the memory usage. These 2 lines of code are already available in your code. But I am not very sure that this code has real effect in the current algo(confirmed this by setting "allow_memory_growth=False" in "config.yml"). Also the same algo takes only about 1.5-2GB of memory in GPU laptops.

As my objective is to run 2 instance of the OD algo as explained above, can you guide me how to control the memory allocation effectively in tensorflow ?

Thanks in advance !!!

Regards, Niran

naisy commented 5 years ago

Hi @Niran89,

Input of prediction can take multiple images at once. See sample code: https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb This input is described as follows.

      # Run inference
      output_dict = sess.run(tensor_dict,
                             feed_dict={image_tensor: np.expand_dims(image, 0)})

The np.expand_dims(image, 0) means [image], therefore the input is as an image array. Here you can have multiple images at once like [image1, image2, image3] This is a remnant of mini batch training.

The output is described as follows.

      # all outputs are float32 numpy arrays, so convert types as appropriate
      output_dict['num_detections'] = int(output_dict['num_detections'][0])
      output_dict['detection_classes'] = output_dict[
          'detection_classes'][0].astype(np.uint8)
      output_dict['detection_boxes'] = output_dict['detection_boxes'][0]
      output_dict['detection_scores'] = output_dict['detection_scores'][0]

[0] of output_dict['num_detections'][0] is the first element of result array. If the input is [image1, image2, image3], the output is as follows.

output1 = output_dict['num_detections'][0]
output2 = output_dict['num_detections'][1]
output3 = output_dict['num_detections'][2]

Do the same thing for num_detections, detection_classes, detection_boxes and detection_scores.

As far as memory permits, you can use many images at once for input.

Niran89 commented 5 years ago

Dear Naisy,

Thank you for your inputs.

I have tried passing images in batch to sess.run(). For now, I am feeding and processing 2 images in a batch and was able to extract the output and display it as well. RAM consumption is as expected. But on the other side, I could see a drop in whole algorithm performance (almost 10-15 fps drop when compared to passing single image as input).

Will there be such huge drop in the performance while processing images in batch? Should GPU and CPU be handled in a different way for this approach?

Regards, Niran

naisy commented 5 years ago

Hi @Niran89,

Prediction with multiple inputs should be faster than prediction one by one. If it is slow, I think that there is a bottleneck in other parts. Can you check only the time of sess.run()?

naisy / realtime_object_detection

Query on OD algo memory utilization in Xavier Developer kit #66