On the inconsistency of Taxid and BERTax taxonomy labels and the calculation of evaluation metrics for AveP.

NickShanyt commented 11 months ago

Hi! I'm interested in your work and I'm trying to reproduce the results on the data you released, but I'm having some problems.

1, The released sequence data contains taxid, and I used NCBI to map these taxids into taxonomic classification, and I got the corresponding taxonomic level for each sequence. However, many of these taxonomic labels obtained cannot correspond to those labels in the BERTax model(5 superkingdom，44 phylum，156 genus), and some of them I have corrected manually.

Although I have done the correction in the final dataset, the genus level correction is a bit difficult in similar dataset and non-similar dataset. I would like to ask, is this an objective problem right? Is there any possible solution?

2, I would also like to ask if the Accuracy and AveP metrics mentioned in the paper are accuracy and precision as we know them? Use from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score is it possible to calculate the same metrics mentioned in the paper?

Thank you for your work.

f-kretschmer commented 10 months ago

Sorry for the late answer.

Since we did our evaluations, the NCBI taxonomy has likely had some changes. Here is a taxdump for the version we used: https://upload.uni-jena.de/data/656deff28d9cd2.73093822/taxdump.tar.gz. It can be used with ete3 (https://github.com/f-kretschmer/bertax_training/blob/master/utils/tax_entry.py).
The average precision was calculated based on micro average Precision-Recall-curves (sklearn.metrics.average_precision_score). For the accuracy, we used a balanced version due to unbalanced data: taking the mean over all superkingdom classes, as described in the paper. Additionally, there are also confusion matrices for everything here: https://github.com/f-kretschmer/bertax/tree/master/confusion_matrices.

Hope this helps!

f-kretschmer commented 10 months ago

I'm sorry, I think the taxdump.tar.gz is the incorrect version, this must be the correct one: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2021-04-01.zip

yongrenr commented 5 months ago

Hello! I'm interested in your work and I'm trying to reproduce the results on the data you released, but I'm having some problems.

"The average precision was calculated based on micro average Precision-Recall-curves (sklearn.metrics.average_precision_score). For the accuracy, we used a balanced version due to unbalanced data: taking the mean over all superkingdom classes, as described in the paper. Additionally, there are also confusion matrices for everything here: https://github.com/f-kretschmer/bertax/tree/master/confusion_matrices." s I wonder if this accuracy calculation is only used for superkingdom classes, and is it used in phylum classes and genus classes？

f-kretschmer commented 5 months ago

Hi!

Both the balanced accuracy calculation (sklearn.metrics.balanced_accuracy_score) and average precision calculation (sklearn.metrics.precision_score) is used for all ranks.

yongrenr commented 5 months ago

Hi!

Both the balanced accuracy calculation (sklearn.metrics.balanced_accuracy_score) and average precision calculation (sklearn.metrics.precision_score) is used for all ranks. Thank you very much for your prompt reply!!!! I'm curious about what kind of metrics are used in your PNAs paper？Thank u！!!!

f-kretschmer commented 5 months ago

In this table it is Average Precision (AveP), but we also have Precision-Recall-plots, ROC-curves and balanced accuracy.

yongrenr commented 5 months ago

In this table it is Average Precision (AveP), but we also have Precision-Recall-plots, ROC-curves and balanced accuracy.

So comprehensive!! I have one more small question.On the Closely and Distantly datasets, the performance of the phyl is average. But why do gates work so well in the Final dataset? I'd like to ask if you have done anything else other than changing the number of attention heads. Thank you very much!!!

f-kretschmer commented 5 months ago

The "final" dataset has a lot more data and also an additional output layer for "genus" prediction. Everything is detailed in the section "Performance of Final BERTax Model" in the PNAS Paper. See especially SFig. 2, which has a visualization trying to show why adding the genus layer leads to better performance.

rnajena / bertax

On the inconsistency of Taxid and BERTax taxonomy labels and the calculation of evaluation metrics for AveP. #14