leggedrobotics / wild_visual_navigation

Wild Visual Navigation: A system for fast traversability learning via pre-trained models and online self-supervision
https://sites.google.com/leggedrobotics.com/wild-visual-navigation
MIT License
113 stars 9 forks source link

Decoupling Feature Extraction from Rest of Pipeline #229

Closed JonasFrey96 closed 6 months ago

JonasFrey96 commented 1 year ago

Maybe also run the small MLP which weights are synced from our WVN learning code allowing for low latency.

wild_visual_navigation_ros

Responsibility: training and graph things

MLP <- segment masks, features -> weights

wild_visual_navigation_runtime

Responsibility: Feature Extraction and Output publishing

<- 3x camera images <- MLP weights and threshold -> segment masks, features -> traversability, confidence output

JonasFrey96 commented 1 year ago

This may be the highest increase in performance we can achieve. Create a demo Python node in which we just run DINO nothing else and measure the throughput. This would simulate the maximum inference we can achieve independent of all learning things.

JonasFrey96 commented 1 year ago

Did a first experiment. Wrote a feature extraction node which is capable to run at 22Hz on my laptop. The forward pass on the left takes roughly 30ms of the callback function the output frequency when just publishing a single float value to measure the performance of the node is a stable 22.5Hz. The input frequency of the images are set to 115 Hz and currently compressed images.

image

JonasFrey96 commented 1 year ago

Interestingly the bottleneck is moving the image onto the GPU and resizing it. Could we reconfigure the Alphasense driver to publish a downscaled version of the image: image If we integrate this change we can easily run all 3 cameras if we split into a hot path feature extraction node and learning node.

mmattamala commented 1 year ago

That's interesting and looks really promising. Is the code somewhere I can take a look?

A few comments:

JonasFrey96 commented 1 year ago

I will push it to a branch inheriting from devel in a moment. This considers the full resolution of std_msgs/Image which is 1080 x 1440 this we then at first have to convert to opencv then to torch (GPU) and then we rescale to (224,224).

  1. I will now try to first rescale and then move to GPU/Torch.
  2. The bag is speed up. I wanted to test what happens with the throughput if we overload the node.
  3. We are using 10Hz - yes the orin should take the same feature extraction time.
  4. Yes fully right we have to load torch twice but Maurice should just buy you a new laptop :)

Okay so now the idea would be:

Feature extraction node:

Input

Output:

Learning Node

Input:

Proprioception to create the supervision graph as we had Extracted Features and Segmentation Mask (synchronized, or here this could be CostumeMessage consisting of the MultiArray + Int32 Image )

Output:

Visualization of the path MLP weights

JonasFrey96 commented 1 year ago

Code under development here: https://github.com/leggedrobotics/wild_visual_navigation/tree/dev/two_node_solution

TODOs

JonasFrey96 commented 6 months ago

done