merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Pangenomics using transcriptomes as "genomic units" needs a bit of adjustments... #839

Closed tdelmont closed 5 years ago

tdelmont commented 6 years ago

Hi there, this is a general comment on potential future developments not related to a specific anvi'o version, or bug.

I recently realized that the pangenomic workflow of anvi'o could be improved at 3 levels in order to include (meta)transcriptomes as "genomic units". This is particularly useful in cases where a reference genome is not available, but transcriptomic data is.

Here are the 3 bottlenecks I identified so far:

(1) each contig is in theory a "gene fragment" for transcriptomic data, so it would be very useful to offer a special flag during anvi-gen-contigs-database so that anvi'o knows each contig should be identified as a single gene covering the entire sequence.

(2) this leads to another problem: we do not know the direction of the gene... Would it be possible to use a letter "x" (or else) instead of "r" or "f", so that anvi'o knows we do not know the direction or even the frame of the gene? I realize this might add a lot of complications, but it is key for the third bottleneck.

(3) could it be possible to allow a flag for a blastx (instead of blastp) when computing the gene clusters. This way, all frames of each gene would be computed to find best matches. I might have missed something obvious, but as far as I can see, this could allow the making of relevant gene clusters compatible with both genomic and transcriptomic data...

I cannot share too much details, but I have interesting research avenues that could be explored contingent upon few improvements to increase flexibility of the anvi'o pangenomic workflow.

Does that make sense, and is this of interest to some of the anvi'o developers?

Thanks for reflecting on this request,

Tom

meren commented 6 years ago

Hey Tom,

(1) each contig is in theory a "gene fragment" for transcriptomic data, so it would be very useful to offer a special flag during anvi-gen-contigs-database so that anvi'o knows each contig should be identified as a single gene covering the entire sequence.

This is possible, and I can see how it could be useful. I did run into similar situations and ended up generating external gene calls files with partial gene calls that covered the entire contig. So it doesn't need a change in anvi'o in theory, but I agree that it would make things much easier.

(2) Would it be possible to use a letter "x" (or else) instead of "r" or "f", so that anvi'o knows we do not know the direction or even the frame of the gene? I realize this might add a lot of complications, but it is key for the third bottleneck.

This is doable, but it would take a lot of time and energy we currently can't afford :(

(3) could it be possible to allow a flag for a blastx (instead of blastp) when computing the gene clusters. This way, all frames of each gene would be computed to find best matches.

This is also doable, and in fact is not dependent upon the first one. Although this would be so computationally demanding. We can't blastx only some of the data. The current strategy would only allow us to do reciprocal blastp or reciprocal blastx over the entire genomes storage. If you have contigs databases that are filled with partial gene calls with unknown directions that can't be turned into amino acid sequences reliably, you could elect to use blastx, and wait four years, but I don't see why things wouldn't run smoothly :)

We should think about that.

In an ideal world anvi'o would have a comparative genomics person dedicated to improve these aspects of the platform. We'll see.