markisus / pytagmapper

Python Mapping library for Fiducial Tags
MIT License
20 stars 2 forks source link

Build_map.py is stuck, meaning of get_avg_detection_error #5

Closed AndrasZabo closed 3 months ago

AndrasZabo commented 5 months ago

Hello markisus, Thanks for your nice project! I started to work with your code (pytagmapper_tools/bulit_map.py and show_map.py) and have some difficulties. Hopefully you can help with some answers. Here I explain what I'm doing:

Do you have any idea where the problem lies? Also, can you please give an explanation or describe what get_avg_detection_error does, what this error means?

Thanks a lot!

markisus commented 5 months ago

The program basically works by making a guess of every camera location and every tag location. Once it has these guesses, it creates a virtual camera for every image and makes virtual detections for every tag, and sees how well the virtual detections match the tag positions provided. This is a classic SLAM pose graph architecture.

The get_avg_detection_error is the pixel difference, on average, between the virtual detections and the actual detections. The optimizer works by jiggling all the guesses so that the virtual detections match up with the actual detections better.

Recently, some users have been reporting large errors and lack of convergence like yours. Could you tell me what type of camera you are using? It is very important for inputs to the pytagmapper to be undistorted and calibrated. Additionally, it is important that the camera does not have any sort of autofocus which could change the camera parameters while you are taking your images.

AndrasZabo commented 5 months ago

Thanks for the quick reply, and sorry for the late answer on my side, I was busy with other projects, sorry for that. So:

  1. I'm using a Logitech C270 webcam. I did the calibration (based on that link) so I have the camera matrix, and also undistort the frames before doing any further analysis, detection, etc. I do that using this function: "cv2.undistort(frame, camera_matrix, distortion_coefficients)". At the moment I'm testing other cameras.

  2. Regarding the first section of your answer, am I right that it works like this:

    • on every image it detects the markers (real detection)
    • based on the detection it guesses the location of the camera for every image (virtual camera)
    • from the virtual camera it assumes where the tags shall be (virtual detection), for every image again
    • finally it compares the virtual and real detection and from that it somehow iterate to the best camera position, and also links the different images together based on markers that are seen on both images. This final step is not so clear, if you have any recommendations about what to read/check in order to understand your code better, please send me those. Thanks you!
  3. Finally, could you please give me some insights what might be crucial points when selecting frames from a video stream to analyze? E.g. how different or similar the angle should be (frames from similar or different viewpoints), how many markers should be detected on one frame for successfully building the map, how many frames do I need about one marker, etc?

Thanks a lot!

markisus commented 4 months ago
  1. This sounds right. If you attach your data set to this thread in a zip file, I can take a look and try to investigate what is going wrong.
  2. This is a correct high level understanding. To understand the specifics, you will need to get more familiar with the pose graph framework commonly used in SLAM. This paper https://dellaert.github.io/files/Dellaert06ijrr.pdf gives an overview. The basics are covered up until page 8. Then after that, the authors go into their custom solver (square root SAM) which differs from my implementation (Gaussian Belief propagation). You can find out more about Gaussian Belief propagation here https://gaussianbp.github.io/ . The final piece of the puzzle is the mathematics that allows you to take derivatives with respect to a 3d pose. This is differential geometry, and the relevant details are covered here https://arxiv.org/abs/1812.01537.

To match up with common terminology in SLAM literature, you just need to know that in pytagmapper, the "robot poses" are camera positions, and the "landmarks" are tag positions. The "factors" correspond to tag detections, one for each detected tag in each image.

  1. If you have understood the pose graph concept, it is crucial for the final pose graph to be connected, to correctly build the map. Mathematically, the more angles you have, the better the final estimates will be. However, too many detections / viewpoints will mean the computer has to do a lot more work, and slow down solving time. Furthermore, too many detections complicates the optimization landscape and may create some local minima causing the optimizer to get stuck. This is more an art than a science. It will help if all your markers lie on the same plane, in which case you should optimize using --mode 2d.
AndrasZabo commented 4 months ago

Thanks for the reply! At the moment I'm testing different cameras and they seem to work better. I'll reply in detail as I get results. Thanks!

AndrasZabo commented 3 months ago

Hello markisus! Using better camera solved part of the problem, however I added a loop which searches for images which get stuck, and after removing them from the input for the map building, it executes nicely and builds the map. Thanks for your help, and the nice project!