helme / ecg_ptbxl_benchmarking

Public repository associated with "Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL"
GNU General Public License v3.0
198 stars 87 forks source link

Number of samples and class memberships #14

Closed vandres98 closed 3 years ago

vandres98 commented 3 years ago

Hi, I am a little bit confused about the number of samples in the notebook Finetuning-Example.ipynb. I see 21430 samples in total (train and validation together). However, the paper physionet argues that there are 21837 records and thats also what I see in the data/ptbxl/records100 folder. Why are there 407 records missing?

Other question: How can I interpret the label sets y_train and y_val? Which representation (10000,01000,00100,00010,00001) correspond to which class (normal, MI, STTC, CD, HYP)? I cannot map them according to the numbers of samples in the classes because they don't match.

Can you help me and clear that up? Thank you very much!

helme commented 3 years ago

Hi @vandres98 , thanks for your interest in our work.

  1. there are only 21430 samples in case of "superdiagnostic", since there are 407 samples having no diagnostic statement at all. In utils.select_data we filter those samples to ensure at least one diagnostic label. I hope this answers your first question. label_counts

  2. for corresponding class labels you can use the fourth returning argument from utils.select_data which is an instance of MultiLabelBinarizer having an attribute classes_ containing the ordering of classes as list. This instance is also stored in output/mlb.pkl as a pickle-file.

I hope this answers your questions ;)

Best helme

vandres98 commented 3 years ago

HI helme,

thank you for the answer!

1: Thank you that makes sense! I am confused about your class count though. The ptb-xl paper states the following memberships: Bildschirmfoto von 2021-08-23 16-40-57 Because it's multilabel, the sum of the amounts is of course greater than the sample-size.

  1. I got the y data with the binarized labels already from your Finetuning-Example.ipynb. However, I have troubles assigning them to the (medical) classes like in the table above. Could you provide the connections please? That would be a huge help!

Thank you and best regards Viktoria

helme commented 3 years ago

Hi @vandres98 ,

  1. This was not about class counts, but rather "number of labels per sample"-count, i.e. superdiagnostic_len codes the number of labels associated to each sample (multi label). So most samples (16272) have one label, 4079 samples have two diagnostic labels etc.
  2. as already explained, there is a instance of MultiLabelBinarizer returned in utils.select_data which has an attribute classes_. Alternatively you can load the pickle file (stored in /output/mlb.pkl). E.g.: screenshot

so['CD', 'HYP', 'MI', 'NORM', 'STTC'] is the ordering in this case, e.g. [0,0,0,1,0] is sample associated with diagnostic class 'NORM'.

I hope this answers your questions.

Best, @helme