Closed ashnair1 closed 2 years ago
@ashnair1 @isaaccorley any thoughts on 2 and 3? I would like to fix these issues in the 0.3.1 (coming in approximately 1 week).
Could 3 explain 2? In example 2, the missing patch seems to be right on the edge of the image.
I think fixing the coordinates would likely fix the issue. I would bet that our bbox plotting method doesn't support boxes that are out of image bounds
Point 2 (Missing polygons) is caused by the absence of some boxes in train_data.csv
which means they have no classification.
For example, the missing box here has ID OSBS00011
which is not present in train_data.csv
. itc_rsFile.csv
contains 1180 records while train_data.csv
only contains 1166.
At any rate, this is unrelated to Point 3.
@ethanwhite do you know anything about points 2 and 3?
Are these expected? For 2, should we just ignore those polygons and pretend that they don't exist? For 3, I'm not sure how a model would be able to predict those full polygons. We could either ignore them or crop them to the bounds of the image.
I wrote the dataset so I can provide some insight.
Task1 of the dataset is simply to predict the bbox of the tree crown. Task 2 is to perform both (predict + classify the species). For task1 we shouldn't worry about filtering the boxes without labels. For task2 we should filter them.
I think cropping to the image bounds would be better because technically it is still a tree crown it would just be a partial which would definitely be encountered during inference when tiling a larger ortho if the model was deployed. When writing the dataset I assumed all the geometries would be within the image bounds but looks like this was not the case.
@isaaccorley sounds good to me, can you make a PR with those changes?
@adamjstewart - I've pinged the folks on our team who can best address this. Hopefully they'll chime in after the holiday and once we're all on the same page we can fix things in the core data source as well.
Quick update:
I've clipped the boxes to the image bounds and seems that the boxes with negative pixel coordinates touch the image bounds but have an area of zero.
I think the best option is to clip and remove any boxes with an area of zero. Let me know if this works for everyone.
Edit: Did a quick analysis and this affects 52/1129=~5% of boxes in the train set
Hmm, that's quite a lot of images. Maybe wait for @ethanwhite's team to get back to us.
It's 52 boxes. It affects 19 images. These are the images with boxes that only touch the image bounds.
[
'train/RemoteSensing/RGB/MLBS_25.tif',
'train/RemoteSensing/RGB/MLBS_27.tif',
'train/RemoteSensing/RGB/MLBS_32.tif',
'train/RemoteSensing/RGB/MLBS_34.tif',
'train/RemoteSensing/RGB/MLBS_35.tif',
'train/RemoteSensing/RGB/MLBS_37.tif',
'train/RemoteSensing/RGB/MLBS_39.tif',
'train/RemoteSensing/RGB/MLBS_41.tif',
'train/RemoteSensing/RGB/MLBS_43.tif',
'train/RemoteSensing/RGB/MLBS_45.tif',
'train/RemoteSensing/RGB/MLBS_5.tif',
'train/RemoteSensing/RGB/OSBS_12.tif',
'train/RemoteSensing/RGB/OSBS_15.tif',
'train/RemoteSensing/RGB/OSBS_18.tif',
'train/RemoteSensing/RGB/OSBS_2.tif',
'train/RemoteSensing/RGB/OSBS_24.tif',
'train/RemoteSensing/RGB/OSBS_29.tif',
'train/RemoteSensing/RGB/OSBS_36.tif',
'train/RemoteSensing/RGB/OSBS_38.tif'
]
@isaaccorley does #760 only fix 3 or does it also fix 2? Did you want to submit a separate PR to fix 2 or should we do it in the same PR? No preference from me.
The dataset already performs (2) by filtering boxes without labels. So this isn't technically a bug, it was just a matter of deciding to not refactor to load boxes that don't have labels.
The dataset already performs (2) by filtering boxes without labels.
So you're saying the dataset already does:
For task1 we shouldn't worry about filtering the boxes without labels. For task2 we should filter them.
So issue (2) only exists for task2, but that's okay, and issue (2) doesn't exist for task1 because we aren't filtering them?
Yes. Right now as it's implemented, the train set doesn't have separate setup for task1/task2. It simply loads all boxes and labels and filters the rest. The test set does however support separate task1/task2.
Hi, sorry for the delay. I can help with understanding the data and figuring how to fix (2) and (3).
Point 2 (Missing polygons) is caused by the absence of some boxes in train_data.csv which means they have no classification.
Yes, I can confirm that for some individual trees we could not provide a species label. That is because the field sampling was performed in 20x20m quadrants and stems outside the area of interest were not identified from the field. This allowed for multiple crown objects to make it into the images while their stems did not make into the field sampling area. Since we were testing for both detection and classification, we kept these crown objects, but assigned them to no label.
For OSBS_27 although, there seems to be another issue. The field data only cover the top-left quadrant (~10x10m), while in fact it should cover the entire image. I just checked with the field stem position data retrieved from NEON dataset, and it seems to confirm the presence of stems only where the boxes are. I am looking into it to assess whether there was an issue in retrieving stem coordinates for that specific plot. For now, I'd probably suggest to remove plot OSBS_27.
So issue (2) only exists for task2, but that's okay, and issue (2) doesn't exist for task1 because we aren't filtering them?
This sounds correct to me.
It's 52 boxes. It affects 19 images. These are the images with boxes that only touch the image bounds.
I'll try to look into it today and touch back. I need to dive into the function you are using but I believe it has to with the fact that in some cases the images may have been clipped at the exact border of some crowns.
Right now as it's implemented, the train set doesn't have separate setup for task1/task2
Why is that the case? I can imagine that if you're only goal is performing well on test task1, you would want to keep all boxes for train even if they don't have labels. Doesn't having different logic for train and test actually make the dataset more complicated?
You are right. However when I initially wrote the dataset I assumed that there weren't a few boxes without labels and that boxes wouldn't be outside the bounds of some images. I'll make an issue to rewrite the dataset to have separate task1 and task2 for the train set as well.
Cool, I think that can wait until 0.4 since that's an actual API change. Since you're making a new issue for that, I think we can consider this particular issue solved by #683 and #760. Thanks for all of the hard work everyone! Datasets are complicated!!
Description
While using IDTReeS to train an object detector, I noticed quite a few bugs within this dataset.
1. Coordinates are swapped
Fixed by #683
The above image is a comparison of the actual ground truth of OSBS_27 and the plot produced by torchgeo.
2. Missing polygon
Shapefile contains 5 polygons whereas the plot only shows 4. This issue seems to be at the dataset level.
3. Negative box coordinates in pixel space
This is a weird one. I noticed that some boxes had negative pixel coordinates. Observed this in MLBS_5 but its possible this could be present in others as well.
Comparing QGIS and the corrected torchgeo plot for MLBS_5 didn't reveal much
As you can see, the top 4 polygons are missing. But unlike the previous issue, these polygons do exist. However they seem to be outside the bounds of the image.
This could explain the negative coordinates but I'm not sure as to why there is a difference between the plots.
Steps to reproduce
Most of these issues can be seen by downloading the IDTReeS dataset and comparing the torchgeo plots vs the shapefiles.
Version
0.4.0.dev0