Open iMayuqi opened 1 month ago
Thank you for pointing out this issue! You are correct, there was a problem with the evaluation script for OWL detectors. The issue occurs because the script removes images from the ground truth that only contained annotations not processed by OWL, but retained unprocessed annotations within images that also contained annotations processed by OWL.
We did not encounter this problem during the evaluations mentioned in the paper because we used a JSON containing only the annotations processed by OWL. We created these JSON subsets from the original benchmark to conduct the evaluations on OWL annotations for all the detectors detailed in the supplementary material.
I have fixed this issue by adding the following lines at line 241 of the script:
# in case the ground truth for the image includes captions not processed by the detector, we remove them
relevant_cats = [predictions['category_id'] for predictions in preds_per_image[imm['file_name']]]
mask = torch.isin(target['labels'], torch.tensor(relevant_cats))
target['labels'] = target['labels'][mask]
target['boxes'] = target['boxes'][mask]
Thanks for your answer, I have corrected the code, and the accuracy is indeed closer to the data in the paper! By the way, since the number of data sets I filtered is inconsistent with the number of papers, would you like to share the subdata sets or filter code for testing the owl model?
I uploaded a zip file benchmarks_owl-subset.zip in the v1 release. I hope this helps with your research!
Thank you very much for your sharing! It is very strange that there are still data with tokensize greater than 16 in this subset. May I ask if you did truncation operation?
Sorry for the late response!
No, we did not perform any truncation operation. I believe I understand the issue now. The subset of OWL we used (and uploaded) includes annotations that allow inference for a vocabulary with zero negatives. However, in some cases, the negative captions are longer than the positive ones. For example, a simple color like "red" might be substituted with a more complex one like "light blue." Thus, annotations with these vocabularies are processed for a certain number of hard negatives but are excluded when the vocabulary includes excessively long negatives.
I apologize for any inconvenience this may have caused. We could have been more precise in addressing this issue in the paper. I hope this clarifies your doubts on the matter!
Hello author, I have reproduced the inference of OWL model, but the accuracy on benchmark dataset is 2-4% lower than that in the paper. I think it may be due to problems in filtering data with tokensize greater than 16. Using 1_attribute.json as an example, the original caption count was 2349, but after filtering, I have 1785 left here, and 1816 in the text. The accuracy in OWL (L/14) paper is 26.5, and I reproduce 23.7 (after nms). Besides, I used the default inference set in owl_inference.py, i.e. --disentangled_inferences=False, --nms=False, --n_hardnegatives=5. Nms is applied when calling evaluate_map.py for evaluation. What is the reason for the apparent decline in accuracy? I hope to receive your answer, as it would be of great help to my work.