Closed smta11 closed 4 months ago
Hey @smta11 , thanks for the test dataset. I was able to reproduce your error in both anvi'o v8 and anvi'o-dev.
I did some debugging and got the following information about the gene call that is failing:
gene caller id: ...................................: 2,792
gene start: .................................: 1,057,049
gene stop: ..................................: 1,057,235
contig: .....................................: c_000000000002
contig length: ..............................: 50,020
Clearly, the start and stop positions of the gene (taken from the external gene calls file) are much larger than the length of the contig it is on, which is causing an IndexError
because this part of the code stores nucleotide information at the per-contig level. That is, our list of nucleotide info has a length of 50,020, but we are trying to place information at positions between 1,057,049 and 1,057,235 (which does not work).
Not all genes on contig c_000000000002
have start and stop positions greater than the contig's length. The first 48 genes on the contig have expected ranges, but the next 988 genes are all out of range:
$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start<50020" | wc -l
48
$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start>=50020" | wc -l
988
The next question is, is this issue coming from anvi-script-process-genbank
, or from the original GenBank file? I downloaded the Genbank file for accession GCF_005281615.1 from NCBI and ran the program like this:
anvi-script-process-genbank -i ncbi_dataset/data/GCF_005281615.1/genomic.gbff -O GCF_005281615
I verified that the only difference between the contig sequences provided in the test datapack and in the output FASTA file from this program had to do with the names of the contig headers (since the former is reformatted and the latter is not):
diff GCF_005281615-contigs.fa ../report/GCF_005281615.1-renamed.fasta
1c1
< >NZ_SZZM01000001
---
> >c_000000000001
3c3
< >NZ_SZZM01000010
---
(......)
So I am working with the same input contig sequences, at least.
I directly fed the output of anvi-script-process-genbank
into anvi-gen-contigs-database
like this:
anvi-gen-contigs-database -f GCF_005281615-contigs.fa -o CONTIGS.db --external-gene-calls GCF_005281615-external-gene-calls.txt
And that command worked perfectly on all 20 contigs. :/
This investigation makes me think that something went wrong during the reformatting process, @smta11. How did you update the external gene calls file with the new contig headers after you ran anvi-script-reformat-fasta
? Is it possible that two different contigs could have been mismatched? If so, then this error could be coming from a gene call that has been incorrectly assigned to a much shorter contig, leading to the IndexError we see above.
Regardless, I've added a sanity check for this case (in anvio-dev
) so that you get a nice output error message and not a code traceback. Now when I run anvi-gen-contigs-database -f GCF_005281615.1-renamed.fasta -o Rhodo_def.db --external-gene-calls GCF_005281615_genecall-external-gene-calls_modified.txt
, I get the following error:
Config Error: Something is wrong with your external gene calls file. It seems that the gene
with gene callers id 2792, on contig c_000000000002, has positions that go
beyond the length of the contig. Specifically, the length of the contig is
50020, but the gene starts at position 1057049 and goes to position 1057235.
We've removed the partially-created contigs database for you (but you can see if
if you re-run your command with the `--debug` flag).
Thank you so much @ivagljiva for your extra work on this problem. According to your comments, I have checked my outputs and now understand why I get this problem; I see that the order of contigs were changed after runing anvi-script-reformat-fasta
with -l 0 --simplify-names
options. For example, a contig originally labeled as NZ_SZZM01000002.1 (contig 2) was changed to c_000000000012 (contig 12). Can you please teach me how you match these labels after running runing anvi-script-reformat-fasta
with--simplify-names option
?
Great, I'm so glad that we figured out the problem. :)
Usually its best to run anvi-script-reformat-fasta --simplify-names
with the --report-file
flag, so that you also get a text output that matches the original contig names to their new labels. For example, when I run that on my GCF_005281615 FASTA file, it looks like this:
c_000000000001 NZ_SZZM01000001
c_000000000002 NZ_SZZM01000010
c_000000000003 NZ_SZZM01000011
(.....)
(if you didn't do that, it is probably okay, because this is a small enough genome that you could probably match up the 20 contigs manually).
Then, you can use the report file to replace each old contig name with its corresponding new name in the external gene calls file. I'd probably do it in Python (but you could use any strategy you feel comfortable with). For example, here is how I did it for GCF_005281615 using the Pandas package:
import pandas as pd
ext_genes = pd.read_csv("GCF_005281615-external-gene-calls.txt", sep="\t", index_col=0)
old_name_to_new_name = {}
with open('reformat_report.txt', 'r') as f:
for line in f.readlines():
columns = line.strip().split()
old_name_to_new_name[columns[1]] = columns[0]
ext_genes.contig.replace(old_name_to_new_name, inplace=True)
ext_genes.to_csv("reformatted_external_gene_calls.txt", sep="\t")
I hope that helps. I will close this issue for now, but feel free to reopen it if you feel all your questions weren't addressed :)
Hi @ivagljiva, it went perfectly after correcting the contig number. Thank you so much for your help! I still have genome files which have more than 200 contigs, so I will try anvi-script-reformat-fasta --simplify-names --report-file
to get the label file. I really appreciate your assistance!
Hi anvi'o team,
Thank you so much for developing such a fantastic program. I am trying to make contig databases using external gene call files created from .gbk using anvi-script-process-genbank. It worked perfectly for single contig, but I am getting the error when the genome sequence is composed of more than two contigs.
anvi'o version
Anvi'o .......................................: marie (v8) Python .......................................: 3.10.12
Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2
System info
Ubuntu 20.04.6 LTS anvi'o installed by conda
Detailed description of the issue
Here is the output and error I got:
Input FASTA file .............................: /home/XX/GCF_005281615.1-renamed.fasta
Anvi'o made things up for you
You are generating a new anvi'o contigs database, but you are not specifying a project name for it. FINE. Anvi'o, in desperation, will use the input file name to set the project name for this contigs database (i.e., 'GCF_005281615_1_renamed'). If you are not happy with that, feel free to kill and restart this process. If you are not happy with this name, but you don't like killing things either, maybe next time you should either name your FASTA files better, or use the
--project-name
parameter to set your desired name.Name .........................................: GCF_005281615_1_renamed Description ..................................: No description is given External gene calls ..........................: 5119 gene calls recovered and will be processed.
WARNING
Anvi'o found amino acid sequences in your external gene calls file that match to 5119 of 5119 gene in it and will use these amino acid sequences for everything.
EXTERNAL GENE CALLS PARSER REPORT
Num gene calls in file .......................: 5,119 Non-coding gene calls ........................: 0 Partial gene calls ...........................: 0 Num amino acid sequences provided ............: 5,119
CONTIGS DB CREATE REPORT
Split Length .................................: 20,000 K-mer size ...................................: 4 Skip gene calling? ...........................: False External gene calls provided? ................: True External gene calls file have AA sequences? ..: True Proper frames will be predicted? .............: True Ignoring internal stop codons? ...............: False Splitting pays attention to gene calls? ......: True [14 Feb 24 04:52:42 The South Loop] Contig "2" has 1036 genes, and (...) ETA: 25s ✖ anvi-gen-contigs-database encountered an error after 0:00:03.134130 Traceback (most recent call last): File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/dbops.py", line 4674, in compress_nt_position_info nt_position_info_list[nt_position] = 4 IndexError: list assignment index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/XX/.conda/envs/anvio-8/bin/anvi-gen-contigs-database", line 63, in
main(args)
File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/terminal.py", line 915, in wrapper
program_method(*args, **kwargs)
File "/home/XX/.conda/envs/anvio-8/bin/anvi-gen-contigs-database", line 32, in main
a.create(args)
File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/dbops.py", line 4529, in create
nt_position_info_list = self.compress_nt_position_info(contig_name, contig_length, genes_in_contig, genes_in_contigs_dict)
File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/dbops.py", line 4685, in compress_nt_position_info
nt_position_info_list[nt_position] = 8
IndexError: list assignment index out of range
Files / commands to reproduce the issue
Command: anvi-gen-contigs-database -f GCF_005281615.1-renamed.fasta -o Rhodo_def.db --external-gene-calls GCF_005281615_genecall-external-gene-calls_modified.txt
fasta and gene call file are available from the following link: https://www.dropbox.com/scl/fi/7ub2p8d6kj7w4qyk2fsv3/report.zip?rlkey=ar9aih265vi6tcdkoax5vvuij&dl=0