Unable to reproduce Fmax metric for PTB-XL

ambroslins commented 6 months ago

I managed to train the xresnet1d101 model on the PTB-XL dataset using the reproduce_results.py script with different experiments, however when I evaluate the model predictions, I get different results than the ones reported in Table II.

To evaluate the predictions I used the following script, where I adapted the code from the utils module to compute the optimal thresholds for the F1 score:

from pathlib import Path

import numpy as np
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
from utils import utils

def find_optimal_cutoff_threshold_for_fbeta(
    target, predicted, beta, n_thresholds=100
):
    thresholds = np.linspace(0.00, 1, n_thresholds)
    scores = [
        utils.challenge_metrics(
            target, predicted > t, beta1=beta, beta2=beta, single=True
        )["F_beta_macro"]
        for t in thresholds
    ]
    optimal_idx = np.argmax(scores)
    return thresholds[optimal_idx]

def find_optimal_cutoff_thresholds_for_fbeta(y_true, y_pred, beta):
    print(f"optimize thresholds with respect to F{beta}")
    return [
        find_optimal_cutoff_threshold_for_fbeta(
            y_true[:, k][:, np.newaxis], y_pred[:, k][:, np.newaxis], beta
        )
        for k in tqdm(range(y_true.shape[1]))
    ]

beta = 1
path = Path("../output")
for exp in ["exp0", "exp1", "exp1.1", "exp1.1.1"]:
    print(f"experiment: {exp}")
    y_test = np.load(path / exp / "data" / "y_test.npy", allow_pickle=True)
    y_test_pred = np.load(
        path / exp / "models" / "fastai_xresnet1d101" / "y_test_pred.npy",
        allow_pickle=True,
    )

    thresholds = find_optimal_cutoff_thresholds_for_fbeta(
        y_test, y_test_pred, beta
    )
    y_pred_binary = utils.apply_thresholds(y_test_pred, thresholds)
    metrics = utils.challenge_metrics(
        y_test, y_pred_binary, beta1=beta, beta2=beta
    )
    metrics["macro_auc"] = roc_auc_score(y_test, y_test_pred, average="macro")
    print(metrics)

For the exp0 experiment I also evaluated the provided y_test_pred.npy file from this repository.

Using this setup I get the following results:	Experiment	Level	F1 max
exp0	all	0.396	0.764
exp1	diag	0.392	0.736
exp1.1	sub-diag.	0.523	0.760
exp1.1.1	super-diag.	0.722	0.815

helme commented 6 months ago

Hi @ambroslins , I haven't looked into it in detail yet, but I have some questions and comments:

to which table you are referring to? because Table II in the referenced paper is something different (i.e. nit about subtasks as in your issue).
and which metric you are are referring to? Fmax? F1 max? Gbeta? Fbeta?
in any case, we used beta1=beta2=2 for all our experiments as opposed to your beta=1. probably this is what you are missing?

I don't know, if this helps, but in any case, I need some clarification here.

Best, Patrick

ambroslins commented 6 months ago

Hi @helme, thanks for the quick response.

My bad, I am referring to Table II from the preprint version.
I was referring to the maximum F1 score as described in the preprint paper:

To summarize F1(τ) by a single number, the threshold is varied and the maximum score, from now on referred to as Fmax, is reported.
I was only looking at the PTB-XL dataset and the newer paper only reports the AUC, which is similar to my results.

I am sorry for the confusion and thanks for your time. Best, Ambros

helme / ecg_ptbxl_benchmarking

Unable to reproduce Fmax metric for PTB-XL #31