Bump torchmetrics from 0.9.2 to 0.10.1

Bumps torchmetrics from 0.9.2 to 0.10.1.

Release notes

Minor patch release

[0.10.1] - 2022-10-21

Fixed

Fixed broken clone method for classification metrics (#1250)

Fixed unintentional downloading of nltk.punkt when lsum not in rouge_keys (#1258)

Fixed type casting in MAP metric between bool and float32 (#1150)

Contributors

@dreaquil, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Large changes to classifications

TorchMetrics v0.10 is now out, significantly changing the whole classification package. This blog post will go over the reasons why the classification package needs to be refactored, what it means for our end users, and finally, what benefits it gives. A guide on how to upgrade your code to the recent changes can be found near the bottom.

Why the classification metrics need to change

We have for a long time known that there were some underlying problems with how we initially structured the classification package. Essentially, classification tasks can e divided into either binary, multiclass, or multilabel, and determining what task a user is trying to run a given metric on is hard just based on the input. The reason a package such as sklearn can do this is to only support input in very specific formats (no multi-dimensional arrays and no support for both integer and probability/logit formats).

This meant that some metrics, especially for binary tasks, could have been calculating something different than expected if the user were to provide another shape but the expected. This is against the core value of TorchMetrics, that our users, of course should trust that the metric they are evaluating is given the excepted result.

Additionally, classification metrics were missing consistency. For some, metrics num_classes=2 meant binary, and for others num_classes=1 meant binary. You can read more about the underlying reasons for this refactor in this and this issue.

The solution

The solution we went with was to split every classification metric into three separate metrics with the prefix binary_* , multiclass_* and multilabel_* . This solves a number of the above problems out of the box because it becomes easier for us to match our users' expectations for any given input shape. It additionally has some other benefits both for us as developers and ends users

Maintainability: by splitting the code into three distinctive functions, we are (hopefully) lowering the code complexity, making the codebase easier to maintain in the long term.

Speed: by completely removing the auto-detection of task at runtime, we can significantly increase computational speed (more on this later).

Task-specific arguments: by splitting into three functions, we also make it more clear what input arguments affect the computed result. Take - Accuracy as an example: both num_classes , top_k , average are arguments that have an influence if you are doing multiclass classification but doing nothing for binary classification and vice versa with the thresholds argument. The task-specific versions only contain the arguments that influence the given task.

There are many smaller quality-of-life improvements hidden throughout the refactor, however here are our top 3:

Standardized arguments

The input arguments for the classification package are now much more standardized. Here are a few examples:

Each metric now only supports arguments that influence the final result. This means that num_classes is removed from all binary_* metrics are now required for all multiclass_* metrics and renamed to num_labels for all multilabel_* metrics.

The ignore_index argument is now supported by ALL classification metrics and supports any value and not only values in the [0,num_classes] range (similar to torch loss functions). Below is shown an example:

We added a new validate_args to all classification metrics to allow users to skip validation of inputs making the computations completely faster. By default, we will still do input validation because it is the safest option for the user. Still, if you are confident that the input to the metric is correct, then you can now disable this, checking for a potential speed-up (more on this later).

Constant memory implementations

Some of the most useful metrics for evaluating classification problems are metrics such as ROC, AUROC, AveragePrecision, etc., because they not only evaluate your model for a single threshold but a whole range of thresholds, essentially giving you the ability to see the trade-off between Type I and Type II errors. However, a big problem with the standard formulation of these metrics (which we have been using) is that they require access to all data for their calculation. Our implementation has been extremely memory-intensive for these kinds of metrics.

In v0.10 of TorchMetrics, all these metrics now have an argument called thresholds. By default, it is None and the metric will still save all targets and predictions in memory as you are used to. However, if this argument is instead set to a tensor - torch.linspace(0,1,100) it will instead use a constant-memory approximation by evaluating the metric under those provided thresholds.

Setting thresholds=None has an approximate memory footprint of O(num_samples) whereas using thresholds=torch.linspace(0,1,100) has an approximate memory footprint of O(num_thresholds). In this particular case, users will save memory when the metric is computed on more than 100 samples. This feature can save memory by comparing this to modern machine learning, where evaluation is often done on thousands to millions of data points.

... (truncated)

Changelog

Sourced from torchmetrics's changelog.

[0.10.1] - 2022-10-21

Fixed

Fixed broken clone method for classification metrics (#1250)

Fixed unintentional downloading of nltk.punkt when lsum not in rouge_keys (#1258)

Fixed type casting in MAP metric between bool and float32 (#1150)

[0.10.0] - 2022-10-04

Added

Added a new NLP metric InfoLM (#915)

Added Perplexity metric (#922)

Added ConcordanceCorrCoef metric to regression package (#1201)

Added argument normalize to LPIPS metric (#1216)

Added support for multiprocessing of batches in PESQ metric (#1227)

Added support for multioutput in PearsonCorrCoef and SpearmanCorrCoef (#1200)

Changed

Classification refactor ( #1054, #1143, #1145, #1151, #1159, #1163, #1167, #1175, #1189, #1197, #1215, #1195 )

Changed update in FID metric to be done in online fashion to save memory (#1199)

Improved performance of retrieval metrics (#1242)

Changed SSIM and MSSSIM update to be online to reduce memory usage (#1231)

Deprecated

Deprecated BinnedAveragePrecision, BinnedPrecisionRecallCurve, BinnedRecallAtFixedPrecision (#1163)

BinnedAveragePrecision -> use AveragePrecision with thresholds arg

BinnedPrecisionRecallCurve -> use AveragePrecisionRecallCurve with thresholds arg

BinnedRecallAtFixedPrecision -> use RecallAtFixedPrecision with thresholds arg

Renamed and refactored LabelRankingAveragePrecision, LabelRankingLoss and CoverageError (#1167)

LabelRankingAveragePrecision -> MultilabelRankingAveragePrecision

LabelRankingLoss -> MultilabelRankingLoss

CoverageError -> MultilabelCoverageError

... (truncated)

Commits

2c40575 releasing 0.10.1
e491f47 MAP: change bool to float32 (#1150)
d84736b bugfix: Evaluate pred_lsum only if lsum in rouge_keys (#1258)
aecfcc1 Fix broken clone method for classification metrics (#1250)
9b3fae4 [pre-commit.ci] pre-commit suggestions (#1247)
04d7329 branch: release/stable
7794b03 releasing v0.10
f97a323 docs: Try to fix links to source (#1240)
d97e8b9 Update docs on the compute groups feature in metric collection (#1237)
9b19a92 Online SSIM and MS-SSIM Computation (#1231)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

tordks / image_classification

Bump torchmetrics from 0.9.2 to 0.10.1 #123

Minor patch release

[0.10.1] - 2022-10-21

Fixed

Contributors

Large changes to classifications

Why the classification metrics need to change

The solution

Standardized arguments

Constant memory implementations

[0.10.1] - 2022-10-21

Fixed

[0.10.0] - 2022-10-04

Added

Changed

Deprecated