[FEATURE REQUEST] Genome redundancy estimations compatible with introns

tdelmont commented 2 months ago

The need

In the case of introns (eukaryotes and some of their viruses), the estimated genomic redundancy can be highly inflated, wrongly indicating that the quality is bad.

The solution

In those situations, the same HMM within a single copy core gene collection will tend to have multiple hits in the same contig. Our solution would be to consider contigs when estimating the redundancy of genomes, and only count 1 when the same HMM has multiple its in the same contig. Simple.

Programs that might provide a new flag are the following: anvi-run-hmms anvi-summarize anvi-estimate-genome-completeness

I expect that a flag only for anvi-run-hmms might be the most simple solution, however anvi'o dev wizards know best and if they consider this request, I am looking forward to see how they elect to deal with it.

Or MAYBE, this strategy should become the norm instead of an alternative, as it would not at all impact quantifying true redundancy in intron-less genomes (redundant hits in multiple contigs) while improving the metric for intron-enriched genomes.

Beneficiaries

This would be beneficial to only a small number of people for now (those that dare to work on eukaryotes and their viruses with anvi'o), however better dealing with introns is certainly a step in the right direction given the ambition of anvi'o :)

At the Genoscope, we are willing to test the new ability and provide feedback to the anvi'o devs, and possibly beyond if results look promising

meren commented 2 months ago

Hey Tom, I think this is a great idea. Thank you for proposing it.

It would be immensely helpful if you used default anvi'o files to characterize a genome with the default way of doing it, and with the proposed way of doing it and show the difference. So we can benchmark our code against those numbers while we use the same genome for testing.

That would be a great help!

tdelmont commented 2 months ago

Perfect. We are currently testing this on a series of MAGs, and as soon as we identify a good example where the difference between the two approaches is significant, we will share here that one as a FASTA and our estimated redundancy results.

meren commented 2 months ago

we will share here that one as a FASTA and our estimated redundancy results

A contigs-db would be much better! :)

tdelmont commented 2 months ago

Here is the data for testing (one genome, as FASTA and as a contigs-db, as well as the directory for our HMM collection (19 markers), and protein sequences for the hits). In that genome, completion is ~90%, and our redundancy estimate is ~90% with the classic mode and 0% with the correction, as all redundant hits are in the same contig!

TEST_data.zip

Tom

meren commented 2 months ago

Holy crap. This is really a very compelling case! :)

Thank you, Tom.

tdelmont commented 2 months ago

Well, it was the most compelling one, so it does not reflect all situations! If need be I can provide a another genome example with more mixed signals. Thank you in advance for looking into the code base with this feature in mind

meren commented 2 months ago

Yes, a mixed case would also help.

It will likely forever stay as a flag since we can't get used to the idea that we treat these SCGs in a very special way by default. It would ruin everything phylogenomics :(

tdelmont commented 2 months ago

Here is the second example. Completion: 100% Redundancy classic mode: 100% Redundancy with correction: 53%

TESTING_DATA_example_2.zip

Tom

merenlab / anvio