nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
317 stars 83 forks source link

Bug in how antiSMASH results are handled? #764

Closed IanDMedeiros closed 2 years ago

IanDMedeiros commented 2 years ago

funannotate v1.8.12

Funannotate annotate failed when I included the argument --antismash /path/to/annotate_misc/antiSMASH.results.gbk because the file did not exist. I dug into it and discovered that annotate was removing the file from annotate_misc. Is this a bug?

As far as I can tell, funannotate remote creates a file antiSMASH.results.gbk in annotate_misc (from remote.py):

298    # now grab the GBK files from folder as you will need just that for annotation, place in annotate_misc folder for auto-detection
299    anti_GBK = os.path.join(outputdir, jobid, os.path.basename(genbank))
300    final = os.path.join(outputdir, 'annotate_misc',
301                          'antiSMASH.results.gbk')
302    shutil.copyfile(anti_GBK, final)

Then,funannotate annotate deletes this file without using it (from annotate.py):

1079   # check if antiSMASH data is given, if so parse and reformat for annotations and cluster textual output
1080   antismash_input = os.path.join(
1081       outputdir, 'annotate_misc', 'antiSMASH.results.gbk')
1082   if args.antismash:
1083       if os.path.isfile(antismash_input):
1084           os.remove(antismash_input)
1085       shutil.copyfile(args.antismash, antismash_input)
nextgenusfs commented 2 years ago

args.antismash is from the command line option, it would be False if nothing was passed. So you don't have to pass the GBK file via the --antismash option, if the file is there it will be parsed (ie file will be put there from remote). otherwise it will use the file that exists, ie so if you pass an file there perhaps you re-ran it and it should not use existing file in that location.

IanDMedeiros commented 2 years ago

Ok—don't use --antismash if I just ran antismash with remote, got it.

With that I am almost to the end of the pipeline ("writing genome annotation table!"), but there is one more error I have no idea what to do with... Seems to suggest that there is an "E" in the nucleotide CDS sequence, but I checked and there is not. Is there something else that could cause this?

[Aug 11 11:21 PM]: OS: CentOS Stream 8, 46 cores, ~ 230 GB RAM. Python: 3.8.12 [Aug 11 11:21 PM]: Running 1.8.12 [Aug 11 11:21 PM]: Found existing output directory annotated_2022-08-10. Warning, will re-use any intermediate files found. [Aug 11 11:21 PM]: Parsing input files [Aug 11 11:21 PM]: Existing tbl found: annotated_2022-08-10/predict_results/Neophaeomoniellasp11164_11164.tbl [Aug 11 11:21 PM]: Adding Functional Annotation to Neophaeomoniellasp11164, NCBI accession: None [Aug 11 11:21 PM]: Annotation consists of: 10,251 gene models [Aug 11 11:21 PM]: 10,210 protein records loaded [Aug 11 11:21 PM]: Existing Pfam-A results found: annotated_2022-08-10/annotate_misc/annotations.pfam.txt [Aug 11 11:21 PM]: 12,247 annotations added [Aug 11 11:21 PM]: Running Diamond blastp search of UniProt DB version 2022_02 [Aug 11 11:21 PM]: 640 valid gene/product annotations from 986 total [Aug 11 11:21 PM]: Existing Eggnog-mapper results found: annotated_2022-08-10/annotate_misc/eggnog.emapper.annotations [Aug 11 11:21 PM]: Parsing EggNog Annotations [Aug 11 11:21 PM]: EggNog version parsed as 2.1.9 [Aug 11 11:21 PM]: 19,792 COG and EggNog annotations added [Aug 11 11:21 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.79 [Aug 11 11:21 PM]: 2,670 gene name and product description annotations added [Aug 11 11:21 PM]: Existing MEROPS results found: annotated_2022-08-10/annotate_misc/annotations.merops.txt [Aug 11 11:21 PM]: 368 annotations added [Aug 11 11:21 PM]: Existing CAZYme results found: annotated_2022-08-10/annotate_misc/annotations.dbCAN.txt [Aug 11 11:21 PM]: 456 annotations added [Aug 11 11:21 PM]: Existing BUSCO2 results found: annotated_2022-08-10/annotate_misc/annotations.busco.txt [Aug 11 11:21 PM]: 1,274 annotations added [Aug 11 11:21 PM]: Existing Phobius results found: annotated_2022-08-10/annotate_misc/phobius.results.txt [Aug 11 11:21 PM]: Existing SignalP results found: annotated_2022-08-10/annotate_misc/signalp.results.txt [Aug 11 11:21 PM]: 711 secretome and 2,373 transmembane annotations added [Aug 11 11:21 PM]: Parsing InterProScan5 XML file [Aug 11 11:24 PM]: Now parsing antiSMASH v6 results, finding SM clusters [Aug 11 11:24 PM]: Found 32 clusters, 76 biosynthetic enyzmes, and 124 smCOGs predicted by antiSMASH [Aug 11 11:24 PM]: Found 0 duplicated annotations, adding 85,689 valid annotations [Aug 11 11:24 PM]: Converting to final Genbank format, good luck! [Aug 11 11:25 PM]: Creating AGP file and corresponding contigs file [Aug 11 11:25 PM]: Cross referencing SM cluster hits with MIBiG database version 1.4 [Aug 11 11:26 PM]: Creating tab-delimited SM cluster output [Aug 11 11:26 PM]: Writing genome annotation table.

Traceback (most recent call last): File "/hpc/home/idm7/miniconda3/envs/funannotate/bin/funannotate", line 8, in sys.exit(main()) File "/hpc/home/idm7/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main mod.main(arguments) File "/hpc/home/idm7/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/annotate.py", line 1637, in main lib.annotationtable(final_gbk, FUNDB, NoteHeaders, INTERPRO, File "/hpc/home/idm7/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/library.py", line 7672, in annotationtable CDSTranscript = RevComp(CDSTranscript) File "/hpc/home/idm7/miniconda3/envs/funannotate/lib/python3.8/site-packages/funannotate/library.py", line 1887, in RevComp cseq += rev_comp_lib[c] KeyError: 'E'

nextgenusfs commented 2 years ago

I have not seen this before, and yes it does suggest there is an E in the CDS transcript somehow. It should be writing that file line by line so if you look at the last record it wrote it should be the gene immediately after that causing the problem. Unfortunately you can't likely see the dictionary value it's trying to reverse complement. The data are loaded/parsed from the tbl file I believe and are actually all written again in annotate_results should also be a cds transcript file. So doesn't make a lot of sense if it wrote it correctly there but then failed later. You'd have to put a try/except to catch that error and then output the offending record.

IanDMedeiros commented 2 years ago

Took a look at the cds transcript file in annotate_results as you suggested, and... bizarrely, it's doing the thing from #763 again — printing the same sequence to the output over and over. When I ran annotate again without --rename, everything completed without errors.

nextgenusfs commented 2 years ago

Weird. Perhaps try to install the latest release?

IanDMedeiros commented 2 years ago

This turned out to be an error in how I was providing the locus tag prefix to funannotate; because of a poorly constructed table file, there was a hidden newline character in my variable $locus_tag that was messing everything else up. Closing since fixing the table file solved the problem.