merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

[BUG] anvi-gen-contigs-database:errors with importing AUGUSTUS v.3.4.0 external gene calls #1661

Closed ISonets closed 3 years ago

ISonets commented 3 years ago

Short description of the problem

1 major problem when dealing with AUGUSTUS gene calls(more details below)

anvi'o version

Anvi'o .......................................: hope (v7)

Profile database .............................: 35
Contigs database .............................: 20
Pan database .................................: 14
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 2
tRNA-seq database ............................: 1

System info

The system is Ubuntu 16.04 LTS running on server.

Detailed description of the issue

Greetings! I am trying to use anvi'o for my Dekkera bruxellensis pangenomics project. I need to create contigs-db, and to do so, I made gene predictions using AUGUSTUS v.3.4.0 (as I understand, Prodigal isn't suitable because it's prokaryotic gene finder).I successfully converted GFF3 output into tab-delimeted TXT file using anvi-script-augustus-output-to-external-gene-calls .When I trying to create contigs-db, I had this error message:

Config Error: Bad news :( There seems to be at least one gene call in your external gene calls
              file that has an aminio acid sequence that is longer than the expected length of
              it given the start/stop positions of the gene call. This is certainly true for  
              gene call number 3177 but anvi'o doesn't know if there are more of these in your
              file or not :/ 

I tried to fix it by simply removing 1 aa from this gene call (just deleted last aa in call) (I know this is completely wrong doing this way, and I tried to figure out what's wrong with my file, maybe just 1 mistake?), but there is more! Totally there were 29 similar messages. After removing 29 aa from my file in specific lines, anvi'o started doing its job, but then... Screenshot from 2021-02-01 22-06-18

Something is definitely wrong, either with my files, or with anvi'o.

Files to reproduce

I have file for you to play with. In archive you can find FASTA file, GFF3 file and TXT file after convertation, I hope that fix will be over soon. issues.zip

meren commented 3 years ago

I successfully converted GFF3 output into tab-delimeted TXT file using anvi-script-augustus-output-to-external-gene-calls

Are you sure you used anvi-script-augustus-output-to-external-gene-calls?

ISonets commented 3 years ago

Yes, I used this script to obtain TXT file from GFF3 output.

meren commented 3 years ago

Yes, this was due to a legitimate bug in the dbops module. It is now fixed in the main branch.

If you are following the active repository, you can run git pull and everything should work for your data.

Thanks for the test case.

meren commented 3 years ago

(if you want to try this solution, you will also need to re-run anvi-script-augustus-output-to-external-gene-calls before re-running anvi-gen-contigs-database).

ISonets commented 3 years ago

Thanks a lot! I will try to rerun my analysis ASAP.

ISonets commented 3 years ago

So, I confirm: scripts and contigs-db generation works as it should be! BUT there is 1 important note about anvi-script-augustus-output-to-external-gene-calls (at least in v.3.4.0). After working, script provides some descriptive data in aa_sequence column for each gene call hit (example on screen shot): Screenshot from 2021-02-03 20-26-36 If you try to use this file to generate you contigs-db, errors will occur because this descriptive data is excess and adds more symbols. Anvi'o recognize this text as aa sequence, and this "sequence" is way longer than it should be. You have to remove this symbols using sed (or any other editor):

sed -i 's/].*//g'  [replace this block with your txt file name]
# This data starts from "]Evidence of... (example on screenshot)". To remove it, use this command.

Results are: Screenshot from 2021-02-03 20-34-13

After removing,all work just fine! Many thanks for such a quick fix!

P.S. @meren , can you please add this info into help page dedicated to script? I think it could be handy.

meren commented 3 years ago

@ISonets, I thought I fixed it in anvi-script-augustus-output-to-external-gene-calls via 82c672081565741d6cb6013c80255ee04bbdbb06 yesterday. Now I'm wondering if it wasn't comprehensive enough. Are you using the same file you sent me in issues.zip when you observe those additional information added to AA seuqences?

meren commented 3 years ago

Running this command on files you sent me yesterday,

anvi-script-augustus-output-to-external-gene-calls -i CL01_copy.gff -o ext.txt

Results in this file which doesn't have those excess text after AA sequences.

ISonets commented 3 years ago

Oops, some misunderstanding. I sent you edited files(but didn't mention it, I am very sorry for this). Process was:

  1. run AUGUSTUS => .gff file
  2. .gff to .txt using augustus-output-to-external-gene-calls script => file with this additional info
  3. removing this info using sed command I wrote => .txt file suitable for contigs-db generation
  4. trying to generate contigs-db => lots of errors just like in issue
  5. trying manually correct this errors => final error(see screenshot in issue) Results now:
  6. run AUGUSTUS => .gff file
  7. .gff to .txt using augustus-output-to-external-gene-calls script => file with this additional info
  8. removing this info using sed command I wrote => .txt file suitable for contigs-db generation
  9. trying to generate contigs-db => works perfectly.
meren commented 3 years ago

@ISonets, can you please send me the GFF file you get from the very first step?

ISonets commented 3 years ago

Sure, I will sent you all source files ASAP.

ISonets commented 3 years ago

Hmm, I think this is not your fault. I checked your commit about augustus (82c7620), and I don't have these changes. I have an explanation: I use anvi'o on server using SSH, and I don't have root on this server. So I'm unable to clone .git because libcurl3 wasn't installed (and I'm unable to install it). Instead I manualy edited dbops.py and augustus script as described in your 2 commtis describing this issue. It's quick-and-dirty solution, I know, but I can't find any other way to do my task.

meren commented 3 years ago

I see. Then you should be able to copy the program anvi-script-augustus-output-to-external-gene-calls from the main repository into your home directory, like this,

wget https://raw.githubusercontent.com/merenlab/anvio/master/sandbox/anvi-script-augustus-output-to-external-gene-calls

and run it the following way:

python anvi-script-augustus-output-to-external-gene-calls -i XXX -o XXX