merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

[BUG] contigs.db has issues when inputing aa_sequence in the external-gene-call file #2310

Closed pcampiteli closed 1 month ago

pcampiteli commented 1 month ago

Short description of the problem

Greetings everyone, Meren asked to use this channel to discuss this topic from the discord server "https://discord.com/channels/1002537821212512296/1263173335479488534" Hello everyone,

I successfully conducted the Anvi'o pangenomics pipeline using a set of NCBI annotated genomes, alongside custom Funannotate genome annotations for assemblies without annotations. In my first attempt, I used the external gene calls without the aa_sequence column. Although I completed the analysis, it presented some issues. So I re-made the analysis, now resolving some issue related to specific files and also including the aa_sequence in the external-gene-call file. In this step I'm facing issues to create the contigs db

anvi'o version

Marie V8

System

high performance python server

Detailed description of the issue

To improve the analysis, I am redoing it, this time including the aa_sequence column in the gene-contigs-database. Initially, I extracted the protein sequences from a Funannotate output, but encountered errors in the files, likely due to the Galaxy platform scripts. To avoid these errors, I used gffread on the .gff3 files to extract the protein sequences and a script to include the aa_sequence in the external-gene-call.

However, when running the codes, I encountered the following error for almost 90% of the contigs databases I'm using in the pangenome analysis: Config Error: Bad news 😦 There seems to be at least one gene call in your external gene calls file that has an aminio acid sequence that is longer than the expected length of it given the start/stop positions of the gene call. This is certainly true for gene call number 4185 but anvi'o doesn't know if there are more of these in your file or not :/

This is puzzling since the protein sequences were directly extracted from the .gff3 files used to construct the external gene calls. I filtered mRNA and tRNA features and constructed the external gene call with this information. Has anyone experienced a similar issue or have any insights on how to resolve this problem? My goal is to make the pangenome analysis with the aa_sequence, in my mind it would improove the analysis and also I will be able to use the results in other softwares I'm planning to use.

command anvi-gen-contigs-database -f "/storage4/h.paulocampiteli/pangenome/genome_repo/GCA_000331835.2/GCA_000331835.2_ThamGD12_genome.fasta" --project-name ThamGD12 -o ThamGD12.db --external-gene-calls "/storage4/h.paulocampiteli/pangenome/anvio/genomes-db.2/full_external/GCA_000331835.2_ThamGD12_external_gene_call.txt" --ignore-internal-stop-codons --skip-predict-frame

Files in this drive folder: https://drive.google.com/drive/folders/1FVY4umkep34-mgtNmh2lH5Z5Bf9ib2xT?usp=sharing

meren commented 1 month ago

Hi @pcampiteli, thanks for sending these files.

Now I'm realizing that this is a eukaryotic genome. I'm not sure if the gene calls are truly representative of the genes of the organism or not, but I assume that they are since you have a GFF file. But providing a eukaryotic gene call to anvi'o through a single start and stop position, which may include multiple introns and exons, will result in nonsensical amino acid sequences when the nucleotides between the start and stop position are translated without any consideration of the introns.

That's why most of the gene calls in your external-genes-txt file are not multiplications of three. When you use --ignore-internal-stop-codons, anvi'o omits all the problems it runs into while calling genes, but that is not the solution :)

Unfortunately anvi'o currently does not know how to work with eukaryotic gene calls. One appropriate solution to push things forward is to ALSO provide amino acid sequences for predicted genes by adding them to your external-gene-calls-txt file as demonstrated here:

https://anvio.org/help/main/artifacts/external-gene-calls/

Apologies for not being able to come up with a better solution at this time.

pcampiteli commented 1 month ago

Hi meren, how are you?

yes these are eukaryotic genomes annotated trough funannotate. And the external gene call is exactly were's my problem. I tried include the aa sequence column but there is an incongruence between the mRNA start/stop and the aa sequence size which stops the program in the contigs-gen step. For 6 assemblies it worked and i'm inspecting what are their difference. For remaining 30 assemblies I can't complete this step. My guess is that the software used to produce the gene structure may have a different patterns to identify the feature with 0-1 index. Or the size of the aa_sequence is considering only the cds, when introns are present the start-stop would be bigger than the actual protein size which would be a problem to the software apparently. Either way I can't think of a solution on my own.

meren commented 1 month ago

Would you consider sending the external gene calls with AA sequences? Because it should work and if there's a bug I can try to fix it.

pcampiteli commented 1 month ago

Hello I should had included such file in the first post. my bad!

I've included a external_gene_calls_aa_sequence.txt and I also included on another folder the files of the contig_db made with the same funannotate annotation pipeline and had no issue. It may helps us understand heres the drive folder

Let me know if you need anything else or have any issue accessing the files in the folder Thanks for your time

meren commented 1 month ago

Hi @pcampiteli, I found the issue that leads to the AA sequences in external gene calls to not be utilized. This is due to a control that assumes every AA sequence in that file should have a length that agrees with the length of the gene call it belongs. This of course is not always the case for eukaryotic genomes, so I disengaged this piece of code in the commit above.

Now everything works well with the external gene calls file that contains the AA sequences:

anvi-gen-contigs-database -f GCA_000331835.2_ThamGD12_genome.fasta \
                          --external-gene-calls GCA_000331835.2_ThamGD12_external_gene_call_aa_seq.txt \
                          -o GCA_000331835.2_ThamGD12.db
Name .........................................: GCA_000331835_2_ThamGD12_genome
Description ..................................: No description is given
External gene calls ..........................: 11352 gene calls recovered and will be processed.

WARNING
===============================================
Anvi'o found amino acid sequences in your external gene calls file that match to
11352 of 11352 gene in it and will use these amino acid sequences for
everything.

EXTERNAL GENE CALLS PARSER REPORT
===============================================
Num gene calls in file .......................: 11,352
Non-coding gene calls ........................: 217
Partial gene calls ...........................: 0
Num amino acid sequences provided ............: 11,135
  - For complete gene calls ..................: 11,135
  - For partial gene calls ...................: 0
Frames predicted .............................: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0
Gene calls marked as NONCODING ...............: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0
Gene calls with internal stops ...............: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0

CONTIGS DB CREATE REPORT
===============================================
Split Length .................................: 20,000
K-mer size ...................................: 4
Skip gene calling? ...........................: False
External gene calls provided? ................: True
External gene calls file have AA sequences? ..: True
Proper frames will be predicted? .............: True
Ignoring internal stop codons? ...............: False
Splitting pays attention to gene calls? ......: True
Contigs with at least one gene call ..........: 598 of 745 (80.3%)
Contigs database .............................: A new database,  GCA_000331835.2_ThamGD12.db, has been created.
Number of contigs ............................: 745
Number of splits .............................: 2,084
Total number of nucleotides ..................: 38,431,721
Gene calling step skipped ....................: False
Splits broke genes (non-mindful mode) ........: False
Desired split length (what the user wanted) ..: 20,000
Average split length (what anvi'o gave back) .: 21,320

✓ anvi-gen-contigs-database took 0:00:56.176448

Files to revisit this issue in the future if necessary: GCA_000331835.2_ThamGD12_external_gene_call_aa_seq.txt.gz GCA_000331835.2_ThamGD12_genome.fasta.gz

pcampiteli commented 1 month ago

hey meren thanks for your help! Now my pangenome will be even better. But let just ask, what should I do to include this changes in my anvio version? Do i have to manually change the genecalls.py script? or just update anvio that this change will be applied?

meren commented 1 month ago

You will need to install anvio-dev, @pcampiteli, and the change will be already there for you to use the same files.

pcampiteli commented 1 month ago

Hey just a quick update, the changes worked fine and I could create the contigs db using aa sequences in the external gene call file!

Thanks for your help M.eren :)

meren commented 1 month ago

Excellent!