Begin perception pipeline: Detect people & velocities

katie-hughes commented 1 year ago

One method is to use the Unitree's cameras to detect people. Here is a method I found in opencv to place bounding boxes around people: https://thedatafrog.com/en/articles/human-detection-video/ . I could also use an ML model, for example YOLO: https://medium.com/@luanaebio/detecting-people-with-yolo-and-opencv-5c1f9bc6a810

Issues with using the cameras:

There might be gaps between the side/front/back cameras. If a person enters these blind spots, weird stuff might happen.
The cameras are kind of bad quality
Since the cameras are low to the ground, this might only detect people's legs. It may be harder to distinguish between legs vs if you can see the entire person.
Need to transform the people location in the camera into the location relative to the robot's body frame. Not exactly sure what these offsets are and they are also different for each of the cameras.

I can also use the LIDAR to detect people. This package (https://github.com/ethz-asl/dynablox, recommended by Muchen) seems close to what I would need, but it is for ROS 1. Here is another code set I found: https://pcl.readthedocs.io/projects/tutorials/en/latest/gpu_people.html

Issues with using the LIDAR:

probably fewer resources available for working with point clouds vs working with images
LIDAR/point cloud will also be used to do SLAM. Is this too much point cloud processing (ie slow)?

At the end of the day I want to publish a topic that contains

person ID (making sure that this is unique/distinct is important!)
position relative to body frame
velocity relative to body frame

katie-hughes commented 1 year ago

Decided on using on board cameras for this!

Current plan: Use the onboard cameras on the unitree along with Marno's people detection YOLO code. So the pipeline would be: unitree_camera package -> published images -> YOLO -> people bounding boxes. An issue with this is they do not have full 360 degree range, but to start i will just use the front facing camera.

katie-hughes commented 1 year ago

https://github.com/katie-hughes/unitree_crowd_nav/assets/53623710/107f543e-daa6-47a4-b674-6432ec0f4118

katie-hughes commented 1 year ago

The above video (and the testing i'm doing thus far) is using the coco dataset. This is definitely overkill for the final application and my goal is to train a new yolov7 model for only people detection. Coco dataset takes on average 0.5 seconds to evaluate for an image and I think I need to evaluate on both left and right images. Marno's guide dog dataset takes around 0.08 seconds to evaluate image which is a speed I can actually work with properly but it is less accurate.

katie-hughes commented 1 year ago

Another thing I am currently working on is how to get the depth properly (and as a result real world coordinates). The depth image from the unitree is really messed up and has lots of gaps. Additionally, I have image streams from left and right cameras. My naive approach was to get a pixel location from the centroid of a person's bounding box from say the left image only and look this same pixel up in the depth image. The issue with this is these two are not rectified and also the origins may be different because the depth image is incorporating both the left and right camera feeds. Another option is to detect the person in both the left and right and do the stereo calculation myself to get the depth. I think this wouldn't be computationally difficult but I don't know if it will be robust enough. The final thing I am thinking about is how to get the x,y coordinates of the person once I have the depth (z). This might take some work as I don't have the camera calibration parameters explicitly since it is unclear how to extract them from UnitreeCameraSDK. Might have to do the calibration manually to get these.

katie-hughes commented 1 year ago

I completed the camera calibration (and created a separate issue for that workflow). My idea is to publish the pixel locations corresponding to people, and then a C++ node can process these/convert to real coordinates using the image geometry package.

katie-hughes commented 1 year ago

I have the whole pipeline in place (yolo object detection -> publish centroids -> convert to 3d position, which can work for all 3 cameras). The problem is it is just super slow. There is no way to smooth out this trajectory just because I receive such a low frequency of points.

katie-hughes / brne_social_nav

Begin perception pipeline: Detect people & velocities #6