Open katie-hughes opened 1 year ago
Decided on using on board cameras for this!
Current plan: Use the onboard cameras on the unitree along with Marno's people detection YOLO code. So the pipeline would be: unitree_camera package -> published images -> YOLO -> people bounding boxes. An issue with this is they do not have full 360 degree range, but to start i will just use the front facing camera.
The above video (and the testing i'm doing thus far) is using the coco dataset. This is definitely overkill for the final application and my goal is to train a new yolov7 model for only people detection. Coco dataset takes on average 0.5 seconds to evaluate for an image and I think I need to evaluate on both left and right images. Marno's guide dog dataset takes around 0.08 seconds to evaluate image which is a speed I can actually work with properly but it is less accurate.
Another thing I am currently working on is how to get the depth properly (and as a result real world coordinates). The depth image from the unitree is really messed up and has lots of gaps. Additionally, I have image streams from left and right cameras. My naive approach was to get a pixel location from the centroid of a person's bounding box from say the left image only and look this same pixel up in the depth image. The issue with this is these two are not rectified and also the origins may be different because the depth image is incorporating both the left and right camera feeds. Another option is to detect the person in both the left and right and do the stereo calculation myself to get the depth. I think this wouldn't be computationally difficult but I don't know if it will be robust enough. The final thing I am thinking about is how to get the x,y coordinates of the person once I have the depth (z). This might take some work as I don't have the camera calibration parameters explicitly since it is unclear how to extract them from UnitreeCameraSDK. Might have to do the calibration manually to get these.
I completed the camera calibration (and created a separate issue for that workflow). My idea is to publish the pixel locations corresponding to people, and then a C++ node can process these/convert to real coordinates using the image geometry package.
I have the whole pipeline in place (yolo object detection -> publish centroids -> convert to 3d position, which can work for all 3 cameras). The problem is it is just super slow. There is no way to smooth out this trajectory just because I receive such a low frequency of points.
One method is to use the Unitree's cameras to detect people. Here is a method I found in opencv to place bounding boxes around people: https://thedatafrog.com/en/articles/human-detection-video/ . I could also use an ML model, for example YOLO: https://medium.com/@luanaebio/detecting-people-with-yolo-and-opencv-5c1f9bc6a810
Issues with using the cameras:
I can also use the LIDAR to detect people. This package (https://github.com/ethz-asl/dynablox, recommended by Muchen) seems close to what I would need, but it is for ROS 1. Here is another code set I found: https://pcl.readthedocs.io/projects/tutorials/en/latest/gpu_people.html
Issues with using the LIDAR:
At the end of the day I want to publish a topic that contains