merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

genes_in_contigs table should have a "type" column, which states whether the gene is coding or non-coding #1358

Closed ekiefl closed 4 years ago

ekiefl commented 4 years ago

In the current design, it is very hacky/difficult to work with gene collections when you want codon sequences and values like 'codon_order_in_gene' and 'base_pos_in_codon', because noncoding genes do not have these attributes. This issue is magnified greatly when external gene calls are imported during anvi-gen-contigs-database

meren commented 4 years ago

I fully agree. One solution is to literally add a new column to the genes table.

For instance, we can add a variable in constants, something like this:

known_gene_types = set([1: 'CODING',
                        2: 'NONCODING',
                        3: 'UNKNOWN')

We would also require a new column in external gene calls file, type, that would have to have for each gene one of the known types. During import we can make sure everything checks out. If there is no column, we can set the type of each gene to 1 if we can get a translated DNA, and 2 if they have internal stop codons or len(gene_seq) % 3 != 0 (of course when we do this, noncoding gene sequences that happened to satisfy the % 3 != 0 criterion and lack 'internal stop codons' would erroneously be set as coding sequences, but it is still better than what's happening now).

Then we would change every function in anvi'o that reads gene calls from db to accept an optional set variable for gene type(s). So we can get all genes and work with only on those that match to a specific type (i.e., calculating SCVs only for coding genes) or request a specific subset of genes depending on the context (i.e., 1 for pangenomics). This way rRNA and tRNA HMMs could also add their genes with type 2 to avoid downstream confusions.

We have to do this for sure.

meren commented 4 years ago

So the plan is to keep the is_partial flag, and implement known_gene_types as a new column.

ekiefl commented 4 years ago

This is solved with the above-mentioned PR