sair-lab / AirLine

AirLine: Efficient Learnable Line Detection with Local Edge Voting (IROS 2023)
https://sairlab.org/airline
BSD 3-Clause "New" or "Revised" License
64 stars 3 forks source link

How are predictions matched to the ground truth before calculating LP? #4

Closed ogencoglu closed 1 year ago

ogencoglu commented 1 year ago

Thanks for the great work especially introducing the LP evaluation metric.

How are predicted line segments matched to the ground truth segments before calculating the LP evaluation metrics? What happens when number of predicted segments is significantly higher or lower than number of ground truth segments? Do you utilize some bi-partite matching such as Hungarian algorithm?

In fact, this question is relevant for any metric (sAP, LP etc.).

Lx017 commented 1 year ago

Thanks for your insight, you are right about the influence of the numbers of predicted line segments, this is one of the most significant issues we met when evaluating methods.

We use LP, which by definition discards the number of line segments, emphasizing how precisely the predicted line covers ground truth, so it is basically a pixel-level evaluation, not line-level. The reason why we give up numbers of segments is that, given different behavior and definition of datasets, a single GT line could match multiple detected, and vice versa.

So under our settings, nothing actually happens when a large line number difference occurs between GT and prediction, no matter in evaluation or training. Because our essential idea is that the number of lines is neither necessary for training nor informative for quantitative evaluation when you compare various methods like LSD.

Not sure if this answers your concerns, I would be happy to follow up!

ogencoglu commented 1 year ago

Thanks for your swift reply.

So if I understand correctly LP essentially works with image-level segmentation masks. It is some variant of pixel accuracy or IoU (Intersection over Union) where the predictions are dilated.

Because our essential idea is that the number of lines is neither necessary for training nor informative for quantitative evaluation

When structural AP metric was introduced by Zhou et al. they have a section where they argue that heatmap based evaluation metrics have two main drawbacks:

  1. They do not account for overlapped lines. So if there are numerous extra overlapped or hugely overlapped lines, there is no penalty for that.
  2. They do not account connectivity, meaning that if a long line segment is divided into several smaller parts, there is no penalty for that.

image

I think these aspects can be quite important for certain use cases. I understand and agree that number of lines is not necessary for actual training though.

Did I understand LP correctly?

Lx017 commented 1 year ago

You are totally correct about sAP and LP, while we have reasons to "fall back" to heat-map based metrics.

Regarding the overlapping issue, the algorithm of Airline and LSD determines that there will be absolutely no overlapping lines, while LCNN, Hough transform, and LETR has. So Airline avoid overlap by design.

The connectivity is a quite complecated topic, you will see cases that is labeled as a single line but is more useful if you detect as seperate segments (like grids, seperate segments are better for localization); Even if you really need merged segments, I think it should be the job of post processing, which is actually easy, and line detectors should just detect segments as precise as possible. Sticking to the defined connectivity in dataset is not necessary. This is from the practical perspective.

Another big problem with sAP (mainly caused by the consideration of connectivity) is that it failed to conduct a fair comparison of different methods as stated in paper. If you take a look at the LSD's sAP score in Zhou's paper, it is "/". It is quite not reasonable that the most trending line detector scored nothing under a metric.

These reasons bring us back to image-based metric and design LP for a fair comparison across distinctive detectors.

ps: emphasizing connectivity could be problematic, LCNN and LETR usually yield overlly connected lines in our generalization experiments. I would personally recommend merging lines after detection.

ogencoglu commented 1 year ago

Thank you for the detailed response. Those are valid arguments indeed!

One more question regarding the dilation. What is the reason to thicken/dilate only the predicted line segment when calculating LP? Would be interesting to hear your insights on what would be the differences between LP and "dilating both the ground truth and predictions and calculating Intersection over Union"?

ogencoglu commented 1 year ago

The reason for my question is that LP does not seem to penalize false positives that are far away from any ground truth segment (unlike IoU which has takes that into account with the Union operator).

So will an algorithm that predicts millions of random line segments get a perfect LP score because eventually it will cover all the ground truth pixels?

Lx017 commented 1 year ago

You got really deep into the paper lol, I am really happy about having a reader like you! It is true that LP does not count false positive, which is indeed a pain spot, but I personally does not think it is a mistake.

Our observation is that most false positive that Airline and LSD produces are actually lines not labeled in images, so we choose to treat them as reasonable prediction due to the limitation of hand-craft datasets. This is also the reason why there is a weak penalty for loss function. Though more "false" positive, better practical results are produced, and much better generalization ability is gained. We counted this treatment towards the reluctant quality of datasets.

It is possible to use a random-million-line creator to achieve high score in LP, but qualitative evalution will not miss that. After all quantitative evaluation is just one single aspect. We also introduced LP with different level like LP0 that asks exact match to GT, if a messy line detector achieve similar LP0 score as AirLine, it will be very obviously bad qualitatively.

If you are looking for a rigorous and perfect metric for line detection that reflects actual performance, I think perhaps neither LP or sAP might be the answer, until a well defined dataset (most probably synthasized dataset instead of hand-craft one) shows up so dataset does not miss any line in images.

Let me know if you have further questions!

ogencoglu commented 1 year ago

Thanks for the reply once gain. Appreciated!

For my use case, I think I would perform line segment level matching of predictions and ground truths with Hungarian algo where the distance is IoU or structural precision and then calculate LP over the matched segments.