microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.32k stars 298 forks source link

Update metrics for issue 2121 #2130

Open robmarkcole opened 1 week ago

robmarkcole commented 1 week ago

Addresses https://github.com/microsoft/torchgeo/issues/2121 for segmentation. Mostly copied from @isaaccorley as here - he is additionally passing on_epoch=True which could be desirable?

Output metrics for binary task with labels=['background', 'target']:

[{'test_loss': 0.008943737484514713,
  'test_multiclassaccuracy_background': 0.9986226558685303,
  'test_multiclassaccuracy_target': 0.6913207173347473,
  'test_AverageAccuracy': 0.8449711203575134,
  'test_AverageF1Score': 0.8546953201293945,
  'test_AverageJaccardIndex': 0.7742567658424377,
  'test_multiclassfbetascore_background': 0.9984697103500366,
  'test_multiclassfbetascore_target': 0.7109203338623047,
  'test_multiclassjaccardindex_background': 0.996947705745697,
  'test_multiclassjaccardindex_target': 0.5515653491020203,
  'test_OverallAccuracy': 0.9969589710235596,
  'test_OverallF1Score': 0.9969589710235596,
  'test_OverallJaccardIndex': 0.9939362406730652,
  'test_multiclassprecision_background': 0.9983214139938354,
  'test_multiclassprecision_target': 0.7317557334899902,
  'test_multiclassrecall_background': 0.9986226558685303,
  'test_multiclassrecall_target': 0.6913207173347473}]
DimitrisMantas commented 1 week ago

Given that most metrics of interest are broken (e.g., all of them when average="macro" and ignore_index is specified (https://github.com/Lightning-AI/torchmetrics/pull/2443) andJaccardIndex which outputs NaN when average==macro instead when you try to take absent and ignored classes into account with zero_division (https://github.com/Lightning-AI/torchmetrics/issues/2535)), should we make an effort to see if and how we could add our own?

I'm saying this because these are only the issues I've found so far, but I've also noticed other suspicious things like the fact that my classwise recall values are not the same as those in the confusion matrix when you normalize it with respect to ground truth (I haven't checked if this is also the case with precision, so when the matrix is normalized column-wise). I'm also pretty confident that if all of this is wrong then micro averaging is also probably wrong.

I should be pretty easy to compute all these metrics straight from the confusion matrix (assuming it at least is correct) and I've actually tried to reimplent them this way but it hasn't really been a priority because I’ve found that all these wrong (?) values are basically a lower bound of the actual ones. If you look at the official implementations, this is actually what they are doing, and my guess is that they have a bug in their logic later on. But indeed all these metrics inherit from StatScores, basically the confusion matrix.

I’m actually pretty dumbfounded these issues are not a top priority for the TorchMetrics team and instead they focus on adding to their docs but to each their own…

robmarkcole commented 1 week ago

@DimitrisMantas good call on my ignoring the ignore_index.! In fairness they do address issues, but have a long backlog. When I made some noise they addressed https://github.com/Lightning-AI/torchmetrics/pull/2198 My opinion is it is better to work with torchmetrics to address the issues, rather than implement from scratch here. I see your comment at https://github.com/Lightning-AI/torchmetrics/issues/2535#issuecomment-2143514389 so perhaps a pragmatic approach is not to add new metrics that we have concerns about, but also to create specific issues which track these concerns

DimitrisMantas commented 1 week ago

Sure, that makes sense; please excuse the rant haha.

robmarkcole commented 1 week ago

Applied on_epoch=True, to all steps for consistency - this results in both per epoch and per step being reported for train only - perhaps this is why @isaaccorley did not apply to train?

train_loss_epoch | 0.028535427525639534
train_loss_step | 0.00008003244874998927
train_AverageAccuracy_epoch | 0.9101453423500061
train_AverageAccuracy_step | 0.9124529361724854
image image

Note that Val is unaffected:

val_AverageAccuracy | 0.8227439522743225

For a task with 2 classes there are a grand total of Metrics (52) being reported between train & val

isaaccorley commented 1 week ago

I just set to be explicit but I think that pytorch lightning or torchmetrics auto sets on_epoch to be False for training and True for all else.

DimitrisMantas commented 1 week ago

You need to set both on_step and on_epoch to get logs only per step or per epoch.

robmarkcole commented 1 week ago

@DimitrisMantas now just performing on_step for train loss, so a more manageable 36 metrics now

robmarkcole commented 1 week ago

Not sure about this failing test ValueError: Problem with given class_path 'torchgeo.trainers.SemanticSegmentationTask'

isaaccorley commented 1 week ago

Must be an issue with on of the minimum versions of the package since it's passing for the other tests.