Closed ekiefl closed 4 years ago
I fully agree. One solution is to literally add a new column to the genes table.
For instance, we can add a variable in constants, something like this:
known_gene_types = set([1: 'CODING',
2: 'NONCODING',
3: 'UNKNOWN')
We would also require a new column in external gene calls file, type
, that would have to have for each gene one of the known types. During import we can make sure everything checks out. If there is no column, we can set the type of each gene to 1
if we can get a translated DNA, and 2
if they have internal stop codons or len(gene_seq) % 3 != 0
(of course when we do this, noncoding gene sequences that happened to satisfy the % 3 != 0
criterion and lack 'internal stop codons' would erroneously be set as coding sequences, but it is still better than what's happening now).
Then we would change every function in anvi'o that reads gene calls from db to accept an optional set
variable for gene type(s). So we can get all genes and work with only on those that match to a specific type (i.e., calculating SCVs only for coding genes) or request a specific subset of genes depending on the context (i.e., 1
for pangenomics). This way rRNA and tRNA HMMs could also add their genes with type 2
to avoid downstream confusions.
We have to do this for sure.
So the plan is to keep the is_partial
flag, and implement known_gene_types
as a new column.
This is solved with the above-mentioned PR
In the current design, it is very hacky/difficult to work with gene collections when you want codon sequences and values like 'codon_order_in_gene' and 'base_pos_in_codon', because noncoding genes do not have these attributes. This issue is magnified greatly when external gene calls are imported during anvi-gen-contigs-database