tusharsangam / TransVisDrone

MIT License
29 stars 3 forks source link

Some confusion about the results of the experiment #9

Open 1430329743 opened 8 months ago

1430329743 commented 8 months ago

Firstly, why are the test results better than the training results? Yes, in FLD, I noticed that your validation (val) results have an mAP of 0.754, but the best result in training is only 0.72456. I would like to understand why the test results are better than the training results. As you mentioned in your paper, the FLD dataset is split in half, with one part for training and the other for testing (val). So, it seems to me that the strongest data in training should be equal to the best results in val. Indeed, the best result in both cases occurs at epoch 19, as seen in https://github.com/tusharsangam/TransVisDrone/blob/main/runs/train/FL/image_size_1280_temporal_YOLO5L_5_frames_FL_end/results.csv. However, the mAP values at epoch 19 are different: 0.72456 in training and 0.754 in val. Does the training log correspond to this value, or were different data augmentations or evaluation metrics used in the results?

Secondly, I would like to confirm that all the models in Table 1 of your paper are experimental comparisons on the NPS and FLD datasets. Is that correct? I mean, all the data used had the same image annotation, right?

Thirdly, regarding the specific quantity of image annotations you used in your paper. You mentioned using Dogfig annotations data (https://github.com/mwaseema/Drone-Detection/tree/main/annotations). Since Dogfig's annotations are significantly fewer than the original unannotated images, I'd like to ask if you only used the reannotated annotations from Dogfig, or if you combined both the original unannotated versions and the detailed annotations from Dogfig. For example, the FLD dataset originally had 38,948 frames, but Dogfig has detailed annotations for only 20,017 frames. Did you use the union of the original and corrected versions or only the frames with detailed annotations?

Fourthly, I'm curious if we only detect frames with labels or if we detect all frames, whether they have detailed annotations or not.

Yipzcc commented 6 months ago

I have the same question. For FLD,why train is 0.72456 and the val is 0.754. Have you found the reason?