linsalrob / ProphagePredictionComparisons

Comparisons of multiple different prophage predictions
MIT License
24 stars 12 forks source link

Accuracy of your genbank annotations #15

Open leannmlindsey opened 6 months ago

leannmlindsey commented 6 months ago

Hello, I wanted to test your compare_predictions_to_phages.py to make sure that it was working, so I used the tsv file containing the reference locations for phages in NC_002655.

I was expecting to get perfect results, since I was using the reference intervals from the Casjens 2003 paper as reported on the PHASTER website statistics page. Instead I got these results:

(base) [u1323098@notch164:scripts]$ python3 compare_predictions_to_phages.py -t /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb -r reference.tsv --fp --fn -v Reading reference.tsv Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb again to get the phage regions Getting from 1879335 to 1897622 Getting from 3551577 to 3565707 Getting from 2966382 to 3015014 Getting from 2668339 to 2688870 Getting from 2285976 to 2330172 Getting from 300073 to 310251 Getting from 1897625 to 1908911 Getting from 1702185 to 1725748 Getting from 310756 to 323112 Getting from 1250521 to 1295458 Getting from 1330857 to 1391923 Getting from 1678706 to 1693737 Getting from 1849488 to 1879269 Getting from 1909139 to 1930250 Getting from 892845 to 930943 Getting from 1730065 to 1756006 Getting from 1626722 to 1673485 Getting from 1655548 to 1696145 Getting from 2743223 to 2788348 Getting from 2118738 to 2165694 Getting from 3263064 to 3270404 Getting from 1521574 to 1530771 Found 789 predicted prophage features Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb Comparing real and predicted Found:

Test set: Phage: 676 Not phage: 4832

Predictions: Phage: 789 Not phage: 4709

TP: 641 FP: 158 TN: 4674 FN: 35

Accuracy: 0.965 (this is the ratio of the correctly labeled phage genes to the whole pool of genes Precision: 0.802 (This is the ratio of correctly labeled phage genes to all predictions) Recall: 0.948 (This is the fraction of actual phage genes we got right) Specificity: 0.967 (This is the fraction of non phage genes we got right) f1_score: 0.869 (this is the harmonic mean of precision and recall, and is the best measure when, as in this case, there is a big difference between the number of phage and non-phage genes)

It seems that there are some differences between the reference intervals listed in your supplementary table and the intervals listed on the PHASTER website.

Do you have a list of where the annotations came from that you are using? Thank you LeAnn