mlcommons / GaNDLF

A generalizable application framework for segmentation, regression, and classification using PyTorch
https://gandlf.org
Apache License 2.0
158 stars 79 forks source link

[BUG] Classification - forced to calculate all metrics #953

Open Linardos opened 4 days ago

Linardos commented 4 days ago

Describe the bug

The current codebase sort of forces things to calculate all possible classification metrics, even if I explicitly ask for just one metric in my config file.

To Reproduce

Steps to reproduce the behavior:

  1. In config.yaml, set problem_type to "classification"
  2. Define your own metrics in config.yaml
  3. Run GaNDLF
  4. You will likely see all possible metrics being calculated even if you did not define them in config

Expected behavior

I expected to just calculate my defined metrics

Environment information

GaNDLF version is the current master, 0.1.0

Solution

How I solved it in my case is by setting calculate_overall_metrics parameters to false in both scripts they appear in (training_loop and forward_pass. See here: https://github.com/search?q=repo%3Amlcommons%2FGaNDLF%20calculate_overall_metrics&type=code ).

sarthakpati commented 4 days ago

So, there are 2 types of metrics:

  1. Per-subject metrics (all of the metrics here [ref]) - these are calculate on a batch-level [ref]
  2. Per-cohort metrics [ref] - these are calculated on an epoch-level (since we need all targets and predictions) [ref]

Currently, only the per-subject metrics are user-controlled, and the per-cohort metrics are always calculated. I keep hearing that for regression and classification problems, the per-subject metrics are kind of meaningless (since we always end up calculating averages through the epoch). And from a statistical perspective, the per-cohort metrics are much more important.

That being said, how would you propose we update this so that this can be accommodated?

Linardos commented 4 days ago

Hmm, I understand per-cohort are considered more important, but even so I'd leave the flexibility to the user in case they want to run sanity checks on per-subject or some other preliminary experiments before getting to the per-cohort ones. In my case, the per-cohort metrics were causing a bug in the FeTS Challenge code which remains to be traced, so I needed to disable them for now at least.

I see two options:

The main thing here is for the user to not be in the dark about this. It seems unintuitive for the code to be over-riding commands that a user is passing to the config, and cause for confusion

sarthakpati commented 4 days ago

The main thing here is for the user to not be in the dark about this. It seems unintuitive for the code to be over-riding commands that a user is passing to the config, and cause for confusion

I agree with you 💯%. I am fine with either suggestion. Do you want to put the PR together and I can review?