Request to add tool and other thoughts

mtisza1 commented 3 years ago

Hi Rob,

I was really excited to see that someone is doing a comparison of different prophage tools. If you are still open to adding more tools for comparison, I hope you try my tool Cenote-Taker 2. If you do use it, please use the settings in the readme for bacterial genomes: -p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2.

I'm sure it will be on the slower side compared to some other methods, but part of that is because Cenote-Taker 2 generates genome maps for each virus prediction. If you instead use Unlimited Breadsticks, which might be more of a direct comparison time-wise, it should go a bit faster (still not gonna be lightning fast).

I have some other thoughts: A really important part of prophage prediction is definition of the prophage/cellular chromosome boundary. Do you have any plans to analyze how close each tool gets to the manually curated boundary? If so, I would also recommend comparing other approaches to CheckV, which does a pretty good job (if not a bit conservative).

Further, I have some suggestions about expanding the prophage dataset that you may or may not like.

1) There are some datasets in SRA that consist of reads from the DNA of induced prophages which also have corresponding bacterial reference genomes. I think these are good examples of prophage that don't rely on manual curation. I have several examples in my notes and I can send those along if you'd like.

2) I haven't tried this extensively, but I wonder about using a pangenome approach to carefully mine prophages. The idea is you would compare the genome content of several bacterial reference genomes for a species, then extract the "regions of plasticity" using PanRGP from PPanGGolin. You could then predict which sequences represent prophage and get the coordinates from the original bacterial reference genome.

I'd be happy to discuss further.

Best regards,

Mike

linsalrob commented 3 years ago

Hi Mike,

We would love to add Cenote-Taker 2/Unlimited breadsticks, and other tools as well, and we are trying to develop this as an extensible platform for everyone. Would you please add a couple of snakemake recipes for the installation and execution of Cenote-Taker 2/Unlimited breadsticks and generate a PR so we can see if our platform is working?

We have recipes for converting gbk -> fasta if that helps, and we can parse your output to identify proteins in prophage regions.

We are adding more genomes (and hopefully others will contribute genomes) and so we want this to be seamless so we can run it time and again as the underlying datasets and models change.

Regarding your other points:

For these genomes we have manually curated exactly the ends of the prophages and so we know where they are, but: i. most tools don't look for attL and attR ii. we don't have a consistent way of marking the prophage locations between tools iii. its not clear exactly how to score that versus just scoring proteins as TP/FP We are excited to consider other possibilities (e.g. fraction of bp correct, etc) but also trying to generate reliable metrics to compare across very different tools.
There are some really cool new tools coming along to parse prophages out of reads from the SRA, but it is a different (but overlapping) problem. At the moment the biggest issue with those tools seems to be accurately identifying the ends of the regions, and there also seem to be several FN regions of prophages that are not induced. Very keen to integrate that technology into our manually curated genomes.
Pangenomes are definitely the way to go, and have been for a while ....

Rob

mtisza1 commented 3 years ago

Rob,

Sounds great. I'll work on adding my tool via snakemake. I'm barely experienced with it, but I'm sure I'll get it up and running pretty quickly, then I'll make a pull request etc.

Thanks for your other thoughts too!

Best regards,

Mike

linsalrob / ProphagePredictionComparisons

Request to add tool and other thoughts #4