Open mtisza1 opened 3 years ago
Hi Mike,
We would love to add Cenote-Taker 2/Unlimited breadsticks, and other tools as well, and we are trying to develop this as an extensible platform for everyone. Would you please add a couple of snakemake recipes for the installation and execution of Cenote-Taker 2/Unlimited breadsticks and generate a PR so we can see if our platform is working?
We have recipes for converting gbk
-> fasta
if that helps, and we can parse your output to identify proteins in prophage regions.
We are adding more genomes (and hopefully others will contribute genomes) and so we want this to be seamless so we can run it time and again as the underlying datasets and models change.
Regarding your other points:
TP
/FP
We are excited to consider other possibilities (e.g. fraction of bp correct, etc) but also trying to generate reliable metrics to compare across very different tools.FN
regions of prophages that are not induced. Very keen to integrate that technology into our manually curated genomes.Rob
Rob,
Sounds great. I'll work on adding my tool via snakemake. I'm barely experienced with it, but I'm sure I'll get it up and running pretty quickly, then I'll make a pull request etc.
Thanks for your other thoughts too!
Best regards,
Mike
Hi Rob,
I was really excited to see that someone is doing a comparison of different prophage tools. If you are still open to adding more tools for comparison, I hope you try my tool Cenote-Taker 2. If you do use it, please use the settings in the readme for bacterial genomes:
-p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2
.I'm sure it will be on the slower side compared to some other methods, but part of that is because Cenote-Taker 2 generates genome maps for each virus prediction. If you instead use Unlimited Breadsticks, which might be more of a direct comparison time-wise, it should go a bit faster (still not gonna be lightning fast).
I have some other thoughts: A really important part of prophage prediction is definition of the prophage/cellular chromosome boundary. Do you have any plans to analyze how close each tool gets to the manually curated boundary? If so, I would also recommend comparing other approaches to CheckV, which does a pretty good job (if not a bit conservative).
Further, I have some suggestions about expanding the prophage dataset that you may or may not like.
1) There are some datasets in SRA that consist of reads from the DNA of induced prophages which also have corresponding bacterial reference genomes. I think these are good examples of prophage that don't rely on manual curation. I have several examples in my notes and I can send those along if you'd like.
2) I haven't tried this extensively, but I wonder about using a pangenome approach to carefully mine prophages. The idea is you would compare the genome content of several bacterial reference genomes for a species, then extract the "regions of plasticity" using PanRGP from PPanGGolin. You could then predict which sequences represent prophage and get the coordinates from the original bacterial reference genome.
I'd be happy to discuss further.
Best regards,
Mike