merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

external gene call file created from more than two contigs #2219

Closed smta11 closed 4 months ago

smta11 commented 4 months ago

Hi anvi'o team,

Thank you so much for developing such a fantastic program. I am trying to make contig databases using external gene call files created from .gbk using anvi-script-process-genbank. It worked perfectly for single contig, but I am getting the error when the genome sequence is composed of more than two contigs.

anvi'o version

Anvi'o .......................................: marie (v8) Python .......................................: 3.10.12

Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

System info

Ubuntu 20.04.6 LTS anvi'o installed by conda

Detailed description of the issue

Here is the output and error I got:

Input FASTA file .............................: /home/XX/GCF_005281615.1-renamed.fasta

Anvi'o made things up for you

You are generating a new anvi'o contigs database, but you are not specifying a project name for it. FINE. Anvi'o, in desperation, will use the input file name to set the project name for this contigs database (i.e., 'GCF_005281615_1_renamed'). If you are not happy with that, feel free to kill and restart this process. If you are not happy with this name, but you don't like killing things either, maybe next time you should either name your FASTA files better, or use the --project-name parameter to set your desired name.

Name .........................................: GCF_005281615_1_renamed Description ..................................: No description is given External gene calls ..........................: 5119 gene calls recovered and will be processed.

WARNING

Anvi'o found amino acid sequences in your external gene calls file that match to 5119 of 5119 gene in it and will use these amino acid sequences for everything.

EXTERNAL GENE CALLS PARSER REPORT

Num gene calls in file .......................: 5,119 Non-coding gene calls ........................: 0 Partial gene calls ...........................: 0 Num amino acid sequences provided ............: 5,119

CONTIGS DB CREATE REPORT

Split Length .................................: 20,000 K-mer size ...................................: 4 Skip gene calling? ...........................: False External gene calls provided? ................: True External gene calls file have AA sequences? ..: True Proper frames will be predicted? .............: True Ignoring internal stop codons? ...............: False Splitting pays attention to gene calls? ......: True [14 Feb 24 04:52:42 The South Loop] Contig "2" has 1036 genes, and (...) ETA: 25s ✖ anvi-gen-contigs-database encountered an error after 0:00:03.134130 Traceback (most recent call last): File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/dbops.py", line 4674, in compress_nt_position_info nt_position_info_list[nt_position] = 4 IndexError: list assignment index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/XX/.conda/envs/anvio-8/bin/anvi-gen-contigs-database", line 63, in main(args) File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/terminal.py", line 915, in wrapper program_method(*args, **kwargs) File "/home/XX/.conda/envs/anvio-8/bin/anvi-gen-contigs-database", line 32, in main a.create(args) File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/dbops.py", line 4529, in create nt_position_info_list = self.compress_nt_position_info(contig_name, contig_length, genes_in_contig, genes_in_contigs_dict) File "/home/XX/.conda/envs/anvio-8/lib/python3.10/site-packages/anvio/dbops.py", line 4685, in compress_nt_position_info nt_position_info_list[nt_position] = 8 IndexError: list assignment index out of range

Files / commands to reproduce the issue

Command: anvi-gen-contigs-database -f GCF_005281615.1-renamed.fasta -o Rhodo_def.db --external-gene-calls GCF_005281615_genecall-external-gene-calls_modified.txt

fasta and gene call file are available from the following link: https://www.dropbox.com/scl/fi/7ub2p8d6kj7w4qyk2fsv3/report.zip?rlkey=ar9aih265vi6tcdkoax5vvuij&dl=0

ivagljiva commented 4 months ago

Hey @smta11 , thanks for the test dataset. I was able to reproduce your error in both anvi'o v8 and anvi'o-dev.

I did some debugging and got the following information about the gene call that is failing:

gene caller id:  ...................................: 2,792
gene start:  .................................: 1,057,049
gene stop:  ..................................: 1,057,235
contig:  .....................................: c_000000000002
contig length:  ..............................: 50,020

Clearly, the start and stop positions of the gene (taken from the external gene calls file) are much larger than the length of the contig it is on, which is causing an IndexError because this part of the code stores nucleotide information at the per-contig level. That is, our list of nucleotide info has a length of 50,020, but we are trying to place information at positions between 1,057,049 and 1,057,235 (which does not work).

Not all genes on contig c_000000000002 have start and stop positions greater than the contig's length. The first 48 genes on the contig have expected ranges, but the next 988 genes are all out of range:

$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start<50020" | wc -l
      48
$ sqlite3 Rhodo_def.db "select * from genes_in_contigs where contig='c_000000000002' and start>=50020" | wc -l
     988

The next question is, is this issue coming from anvi-script-process-genbank, or from the original GenBank file? I downloaded the Genbank file for accession GCF_005281615.1 from NCBI and ran the program like this:

anvi-script-process-genbank -i ncbi_dataset/data/GCF_005281615.1/genomic.gbff -O GCF_005281615

I verified that the only difference between the contig sequences provided in the test datapack and in the output FASTA file from this program had to do with the names of the contig headers (since the former is reformatted and the latter is not):

 diff GCF_005281615-contigs.fa ../report/GCF_005281615.1-renamed.fasta
1c1
< >NZ_SZZM01000001
---
> >c_000000000001
3c3
< >NZ_SZZM01000010
---
(......)

So I am working with the same input contig sequences, at least.

I directly fed the output of anvi-script-process-genbank into anvi-gen-contigs-database like this:

anvi-gen-contigs-database -f GCF_005281615-contigs.fa -o CONTIGS.db --external-gene-calls GCF_005281615-external-gene-calls.txt

And that command worked perfectly on all 20 contigs. :/

This investigation makes me think that something went wrong during the reformatting process, @smta11. How did you update the external gene calls file with the new contig headers after you ran anvi-script-reformat-fasta? Is it possible that two different contigs could have been mismatched? If so, then this error could be coming from a gene call that has been incorrectly assigned to a much shorter contig, leading to the IndexError we see above.

ivagljiva commented 4 months ago

Regardless, I've added a sanity check for this case (in anvio-dev) so that you get a nice output error message and not a code traceback. Now when I run anvi-gen-contigs-database -f GCF_005281615.1-renamed.fasta -o Rhodo_def.db --external-gene-calls GCF_005281615_genecall-external-gene-calls_modified.txt, I get the following error:

Config Error: Something is wrong with your external gene calls file. It seems that the gene
              with gene callers id 2792, on contig c_000000000002, has positions that go
              beyond the length of the contig. Specifically, the length of the contig is
              50020, but the gene starts at position 1057049 and goes to position 1057235.
              We've removed the partially-created contigs database for you (but you can see if
              if you re-run your command with the `--debug` flag).
smta11 commented 4 months ago

Thank you so much @ivagljiva for your extra work on this problem. According to your comments, I have checked my outputs and now understand why I get this problem; I see that the order of contigs were changed after runing anvi-script-reformat-fastawith -l 0 --simplify-names options. For example, a contig originally labeled as NZ_SZZM01000002.1 (contig 2) was changed to c_000000000012 (contig 12). Can you please teach me how you match these labels after running runing anvi-script-reformat-fasta with--simplify-names option?

ivagljiva commented 4 months ago

Great, I'm so glad that we figured out the problem. :)

Usually its best to run anvi-script-reformat-fasta --simplify-names with the --report-file flag, so that you also get a text output that matches the original contig names to their new labels. For example, when I run that on my GCF_005281615 FASTA file, it looks like this:

c_000000000001  NZ_SZZM01000001
c_000000000002  NZ_SZZM01000010
c_000000000003  NZ_SZZM01000011
(.....)

(if you didn't do that, it is probably okay, because this is a small enough genome that you could probably match up the 20 contigs manually).

Then, you can use the report file to replace each old contig name with its corresponding new name in the external gene calls file. I'd probably do it in Python (but you could use any strategy you feel comfortable with). For example, here is how I did it for GCF_005281615 using the Pandas package:

import pandas as pd
ext_genes = pd.read_csv("GCF_005281615-external-gene-calls.txt", sep="\t", index_col=0)
old_name_to_new_name = {}
with open('reformat_report.txt', 'r') as f:
  for line in f.readlines():
    columns = line.strip().split()
    old_name_to_new_name[columns[1]] = columns[0]
ext_genes.contig.replace(old_name_to_new_name, inplace=True)
ext_genes.to_csv("reformatted_external_gene_calls.txt", sep="\t")

I hope that helps. I will close this issue for now, but feel free to reopen it if you feel all your questions weren't addressed :)

smta11 commented 4 months ago

Hi @ivagljiva, it went perfectly after correcting the contig number. Thank you so much for your help! I still have genome files which have more than 200 contigs, so I will try anvi-script-reformat-fasta --simplify-names --report-file to get the label file. I really appreciate your assistance!