microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.69k stars 333 forks source link

IDTReeS bugs #684

Closed ashnair1 closed 2 years ago

ashnair1 commented 2 years ago

Description

While using IDTReeS to train an object detector, I noticed quite a few bugs within this dataset.

1. Coordinates are swapped

Fixed by #683

OSBS_27_before

The above image is a comparison of the actual ground truth of OSBS_27 and the plot produced by torchgeo.

2. Missing polygon

Shapefile contains 5 polygons whereas the plot only shows 4. This issue seems to be at the dataset level.

OSBS_27_missingpoly

3. Negative box coordinates in pixel space

This is a weird one. I noticed that some boxes had negative pixel coordinates. Observed this in MLBS_5 but its possible this could be present in others as well.

Comparing QGIS and the corrected torchgeo plot for MLBS_5 didn't reveal much

MLBS_5_initial

As you can see, the top 4 polygons are missing. But unlike the previous issue, these polygons do exist. However they seem to be outside the bounds of the image.

MLBS_5_second

This could explain the negative coordinates but I'm not sure as to why there is a difference between the plots.

Steps to reproduce

Most of these issues can be seen by downloading the IDTReeS dataset and comparing the torchgeo plots vs the shapefiles.

Version

0.4.0.dev0

adamjstewart commented 2 years ago

@ashnair1 @isaaccorley any thoughts on 2 and 3? I would like to fix these issues in the 0.3.1 (coming in approximately 1 week).

Could 3 explain 2? In example 2, the missing patch seems to be right on the edge of the image.

isaaccorley commented 2 years ago

I think fixing the coordinates would likely fix the issue. I would bet that our bbox plotting method doesn't support boxes that are out of image bounds

ashnair1 commented 2 years ago

Point 2 (Missing polygons) is caused by the absence of some boxes in train_data.csv which means they have no classification.

For example, the missing box here has ID OSBS00011 which is not present in train_data.csv. itc_rsFile.csv contains 1180 records while train_data.csv only contains 1166.

At any rate, this is unrelated to Point 3.

adamjstewart commented 2 years ago

@ethanwhite do you know anything about points 2 and 3?

  1. It seems that not all polygons have a class label
  2. It seems that some polygons have coordinates outside of the bounds of the image

Are these expected? For 2, should we just ignore those polygons and pretend that they don't exist? For 3, I'm not sure how a model would be able to predict those full polygons. We could either ignore them or crop them to the bounds of the image.

isaaccorley commented 2 years ago

I wrote the dataset so I can provide some insight.

  1. Task1 of the dataset is simply to predict the bbox of the tree crown. Task 2 is to perform both (predict + classify the species). For task1 we shouldn't worry about filtering the boxes without labels. For task2 we should filter them.

  2. I think cropping to the image bounds would be better because technically it is still a tree crown it would just be a partial which would definitely be encountered during inference when tiling a larger ortho if the model was deployed. When writing the dataset I assumed all the geometries would be within the image bounds but looks like this was not the case.

adamjstewart commented 2 years ago

@isaaccorley sounds good to me, can you make a PR with those changes?

ethanwhite commented 2 years ago

@adamjstewart - I've pinged the folks on our team who can best address this. Hopefully they'll chime in after the holiday and once we're all on the same page we can fix things in the core data source as well.

isaaccorley commented 2 years ago

Quick update:

I've clipped the boxes to the image bounds and seems that the boxes with negative pixel coordinates touch the image bounds but have an area of zero.

Screen Shot 2022-09-05 at 3 50 56 PM

I think the best option is to clip and remove any boxes with an area of zero. Let me know if this works for everyone.

Edit: Did a quick analysis and this affects 52/1129=~5% of boxes in the train set

adamjstewart commented 2 years ago

Hmm, that's quite a lot of images. Maybe wait for @ethanwhite's team to get back to us.

isaaccorley commented 2 years ago

It's 52 boxes. It affects 19 images. These are the images with boxes that only touch the image bounds.

[
 'train/RemoteSensing/RGB/MLBS_25.tif',
 'train/RemoteSensing/RGB/MLBS_27.tif',
 'train/RemoteSensing/RGB/MLBS_32.tif',
 'train/RemoteSensing/RGB/MLBS_34.tif',
 'train/RemoteSensing/RGB/MLBS_35.tif',
 'train/RemoteSensing/RGB/MLBS_37.tif',
 'train/RemoteSensing/RGB/MLBS_39.tif',
 'train/RemoteSensing/RGB/MLBS_41.tif',
 'train/RemoteSensing/RGB/MLBS_43.tif',
 'train/RemoteSensing/RGB/MLBS_45.tif',
 'train/RemoteSensing/RGB/MLBS_5.tif',
 'train/RemoteSensing/RGB/OSBS_12.tif',
 'train/RemoteSensing/RGB/OSBS_15.tif',
 'train/RemoteSensing/RGB/OSBS_18.tif',
 'train/RemoteSensing/RGB/OSBS_2.tif',
 'train/RemoteSensing/RGB/OSBS_24.tif',
 'train/RemoteSensing/RGB/OSBS_29.tif',
 'train/RemoteSensing/RGB/OSBS_36.tif',
 'train/RemoteSensing/RGB/OSBS_38.tif'
]
adamjstewart commented 2 years ago

@isaaccorley does #760 only fix 3 or does it also fix 2? Did you want to submit a separate PR to fix 2 or should we do it in the same PR? No preference from me.

isaaccorley commented 2 years ago

The dataset already performs (2) by filtering boxes without labels. So this isn't technically a bug, it was just a matter of deciding to not refactor to load boxes that don't have labels.

760 cleans up (3) though.

adamjstewart commented 2 years ago

The dataset already performs (2) by filtering boxes without labels.

So you're saying the dataset already does:

For task1 we shouldn't worry about filtering the boxes without labels. For task2 we should filter them.

So issue (2) only exists for task2, but that's okay, and issue (2) doesn't exist for task1 because we aren't filtering them?

isaaccorley commented 2 years ago

Yes. Right now as it's implemented, the train set doesn't have separate setup for task1/task2. It simply loads all boxes and labels and filters the rest. The test set does however support separate task1/task2.

MarconiS commented 2 years ago

Hi, sorry for the delay. I can help with understanding the data and figuring how to fix (2) and (3).

Point 2 (Missing polygons) is caused by the absence of some boxes in train_data.csv which means they have no classification.

Yes, I can confirm that for some individual trees we could not provide a species label. That is because the field sampling was performed in 20x20m quadrants and stems outside the area of interest were not identified from the field. This allowed for multiple crown objects to make it into the images while their stems did not make into the field sampling area. Since we were testing for both detection and classification, we kept these crown objects, but assigned them to no label.

For OSBS_27 although, there seems to be another issue. The field data only cover the top-left quadrant (~10x10m), while in fact it should cover the entire image. I just checked with the field stem position data retrieved from NEON dataset, and it seems to confirm the presence of stems only where the boxes are. I am looking into it to assess whether there was an issue in retrieving stem coordinates for that specific plot. For now, I'd probably suggest to remove plot OSBS_27.

So issue (2) only exists for task2, but that's okay, and issue (2) doesn't exist for task1 because we aren't filtering them?

This sounds correct to me.

It's 52 boxes. It affects 19 images. These are the images with boxes that only touch the image bounds.

I'll try to look into it today and touch back. I need to dive into the function you are using but I believe it has to with the fact that in some cases the images may have been clipped at the exact border of some crowns.

adamjstewart commented 2 years ago

Right now as it's implemented, the train set doesn't have separate setup for task1/task2

Why is that the case? I can imagine that if you're only goal is performing well on test task1, you would want to keep all boxes for train even if they don't have labels. Doesn't having different logic for train and test actually make the dataset more complicated?

isaaccorley commented 2 years ago

You are right. However when I initially wrote the dataset I assumed that there weren't a few boxes without labels and that boxes wouldn't be outside the bounds of some images. I'll make an issue to rewrite the dataset to have separate task1 and task2 for the train set as well.

adamjstewart commented 2 years ago

Cool, I think that can wait until 0.4 since that's an actual API change. Since you're making a new issue for that, I think we can consider this particular issue solved by #683 and #760. Thanks for all of the hard work everyone! Datasets are complicated!!