Closed jolespin closed 5 years ago
If you add -1 to your start values, things will likely work out just fine :)
Please see the anvi'o string indexing convention here:
http://merenlab.org/2016/06/22/anvio-tutorial-v2/#external-gene-calls
Thanks, it looks like it's working now! I didn't realize it was python-style indexing but I should have known. Is it possible to give translated ORFs to ensure the correct codon table is being used in the translation or an option to specify which codon table is being used in this translation step?
Thanks, it looks like it's working now!
Great!
Is it possible to give translated ORFs to ensure the correct codon table is being used in the translation or an option to specify which codon table is being used in this translation step?
This is something we thought about multiple times, however, we did not address it in the codebase. Perhaps a solution could be anvi-gen-contigs-database to accept an additional FASTA files of amino acids for each gene call described in the external gene calls file. But I am not sure when can we do it realistically as we are all swamped with time-sensitive coding tasks.
But you can always replace the amino acid sequences table in contigs database. If you do it in Python, we can add it to the repo as a script (i.e., anvi-script-update-amino-acid-seqeunces
; takes a contigs db as an argument, and a FASTA file of amino acids where each defline uniquely corresponds to a gene call id). Just an idea.
Best,
Yea, I can imagine there being a lot of moving pieces with a tool suite this expansive and such a large user-base.
If I have some extra time between projects I can help out to create a script that does this and/or a pull request with --external-protein-sequences on anvi-gen-contig-database.
Do you have any tutorials that show how to load the contig.db binary file into python (or is it just a pickled dict)? I can also go through the repo and find an example of where it is loaded and use that but thought I would ask first before I go down any rabbit holes.
Cheers
On May 21, 2019, at 7:15 AM, A. Murat Eren notifications@github.com wrote:
Closed #1165.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Josh,
That would be excellent. If you were to be interested in writing a script, you could get some inspiration from previous scripts. Perhaps anvi-script-add-default-collection
would be a good start.
In your script the main
function would look something like this:
def main(args):
# makes sure things check out
utils.is_contigs_db(args.contigs_db)
filesnpaths.is_file_fasta_formatted(args.fasta)
# get a contigs db instance
dbops.ContigsDatabase(args.contigs_db)
# learn gene caller ids in the congits database:
gene_calls_in_db = contigs_db.db.get_single_column_from_table('genes_in_contigs', 'gene_callers_id')
# learn gene caller IDS in args.fasta
(...)
# make sure IDs compare well (i.e., there are no ids that are in FASTA but not in contigs db).
(...)
# construct your sql query and update aminio acid sequences in amino acid sequences table
import anvio.tables as t
table_name = t.gene_amino_acid_sequences_table_name
contigs_db.db._exec_many(SQL_UPDATE_QUERY_GOES_HERE)
# there are many examples of _exec_many operations in the codebase).
# disconnect
contigs_db.disconnect()
If you decide to take a stab at it and if you need more input please let me know.
I've parsed the orf identifiers from
prodigal v2.6.3
.Here's my
anvi'o
version:Here's my output when trying to build contig database:
Here's my
external-gene-calls
:Here's my
prodigal
orf identifiers:This is my function to parse the
prodigal
identifiers: (yes, I should probably use regex instead but this was a quick test)Also, how do I need to use
--ignore-internal-stop-codons
if the genetic code I'm using has recoded stop codons?