helme / ecg_ptbxl_benchmarking

Public repository associated with "Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL"
GNU General Public License v3.0
198 stars 87 forks source link

Unable to reproduce Fmax metric for PTB-XL #31

Closed ambroslins closed 6 months ago

ambroslins commented 6 months ago

I managed to train the xresnet1d101 model on the PTB-XL dataset using the reproduce_results.py script with different experiments, however when I evaluate the model predictions, I get different results than the ones reported in Table II.

To evaluate the predictions I used the following script, where I adapted the code from the utils module to compute the optimal thresholds for the F1 score:

from pathlib import Path

import numpy as np
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
from utils import utils

def find_optimal_cutoff_threshold_for_fbeta(
    target, predicted, beta, n_thresholds=100
):
    thresholds = np.linspace(0.00, 1, n_thresholds)
    scores = [
        utils.challenge_metrics(
            target, predicted > t, beta1=beta, beta2=beta, single=True
        )["F_beta_macro"]
        for t in thresholds
    ]
    optimal_idx = np.argmax(scores)
    return thresholds[optimal_idx]

def find_optimal_cutoff_thresholds_for_fbeta(y_true, y_pred, beta):
    print(f"optimize thresholds with respect to F{beta}")
    return [
        find_optimal_cutoff_threshold_for_fbeta(
            y_true[:, k][:, np.newaxis], y_pred[:, k][:, np.newaxis], beta
        )
        for k in tqdm(range(y_true.shape[1]))
    ]

beta = 1
path = Path("../output")
for exp in ["exp0", "exp1", "exp1.1", "exp1.1.1"]:
    print(f"experiment: {exp}")
    y_test = np.load(path / exp / "data" / "y_test.npy", allow_pickle=True)
    y_test_pred = np.load(
        path / exp / "models" / "fastai_xresnet1d101" / "y_test_pred.npy",
        allow_pickle=True,
    )

    thresholds = find_optimal_cutoff_thresholds_for_fbeta(
        y_test, y_test_pred, beta
    )
    y_pred_binary = utils.apply_thresholds(y_test_pred, thresholds)
    metrics = utils.challenge_metrics(
        y_test, y_pred_binary, beta1=beta, beta2=beta
    )
    metrics["macro_auc"] = roc_auc_score(y_test, y_test_pred, average="macro")
    print(metrics)

For the exp0 experiment I also evaluated the provided y_test_pred.npy file from this repository.

Using this setup I get the following results: Experiment Level F1 max Fmax (paper)
exp0 all 0.396 0.764
exp1 diag 0.392 0.736
exp1.1 sub-diag. 0.523 0.760
exp1.1.1 super-diag. 0.722 0.815
helme commented 6 months ago

Hi @ambroslins , I haven't looked into it in detail yet, but I have some questions and comments:

  1. to which table you are referring to? because Table II in the referenced paper is something different (i.e. nit about subtasks as in your issue).
  2. and which metric you are are referring to? Fmax? F1 max? Gbeta? Fbeta?
  3. in any case, we used beta1=beta2=2 for all our experiments as opposed to your beta=1. probably this is what you are missing?

I don't know, if this helps, but in any case, I need some clarification here.

Best, Patrick

ambroslins commented 6 months ago

Hi @helme, thanks for the quick response.

  1. My bad, I am referring to Table II from the preprint version.
  2. I was referring to the maximum F1 score as described in the preprint paper:

    To summarize F1(τ) by a single number, the threshold is varied and the maximum score, from now on referred to as Fmax, is reported.

  3. I was only looking at the PTB-XL dataset and the newer paper only reports the AUC, which is similar to my results.

I am sorry for the confusion and thanks for your time. Best, Ambros