nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
321 stars 85 forks source link

KeyError: u'locus_tag' on funannotate compare #362

Closed atiweb closed 4 years ago

atiweb commented 4 years ago

Are you using the latest release? Using version 1.7.2

Describe the bug

 [02:14 PM]: Found 7 clusters, 14 biosynthetic enyzmes, and 19 smCOGs
predicted by antiSMASH
 [02:14 PM]: Found 0 duplicated annotations, adding 31,745 valid annotations
[02:14 PM]: Converting to final Genbank format, good luck!
[02:17 PM]: Creating AGP file and corresponding contigs file
[02:17 PM]: Cross referencing SM cluster hits with MIBiG database version 1.4
[02:17 PM]: Creating tab-delimited SM cluster output
 [02:17 PM]: Writing genome annotation table.
Traceback (most recent call last):
 File "/usr/local/bin/funannotate", line 4, in <module>
   __import__('pkg_resources').run_script('funannotate==1.7.2', 'funannotate')
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py",
 line 658, in run_script
   self.require(requires)[0].run_script(script_name, ns)
 File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py",
 line 1438, in run_script
   exec(code, namespace, namespace)
 File "/usr/local/lib/python2.7/dist-packages/funannotate-1.7.2-py2.7.egg/EGG-INFO/scripts/funannotate",
 line 657, in <module>
   main()
 File "/usr/local/lib/python2.7/dist-packages/funannotate-1.7.2-py2.7.egg/EGG-INFO/scripts/funannotate",
 line 647, in main
   mod.main(arguments)
 File "/usr/local/lib/python2.7/dist-packages/funannotate-1.7.2-py2.7.egg/funannotate/annotate.py",
 line 1385, in main
   lib.annotationtable(final_gbk, FUNDB, final_annotation)
File "/usr/local/lib/python2.7/dist-packages/funannotate-1.7.2-py2.7.egg/funannotate/library.py",
line 6367, in annotationtable
   ID = f.qualifiers['locus_tag'][0]
 KeyError: u'locus_tag'

What command did you issue? /usr/local/bin/funannotate annotate --input /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out --out /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/annotation_out --species Globisporangium_ultimum --strain DAOM_BR144 --force --antismash /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/antismash_OUT/Globisporangium_ultimum_DAOM_BR144.gbk --iprscan /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/annotate_misc/iprscan.xml --cpus 15 Runned after funannotate update [funannotate-annotate.32309.log]

Logfiles (https://github.com/nextgenusfs/funannotate/files/4022165/funannotate-annotate.32309.log) augustus-parallel.log augustus_training.log funannotate-annotate.log funannotate-EVM.log funannotate-p2g.log funannotate-predict.log funannotate-train.log funannotate-trinity.log funannotate-update.log phobius.log

OS/Install Information

Checking dependencies for 1.7.2

You are running Python v 2.7.17. Now checking python packages... biopython: 1.73 goatools: 0.9.5 matplotlib: 2.2.4 natsort: 6.0.0 numpy: 1.13.3 pandas: 0.24.2 psutil: 5.6.2 requests: 2.22.0 scikit-learn: 0.20.3 scipy: 1.2.2 seaborn: 0.9.0 All 11 python packages installed

You are running Perl v 5.026001. Now checking perl modules... Bio::Perl: 1.007002 Carp: 1.42 Clone: 0.39 DBD::SQLite: 1.62 DBD::mysql: 4.046 DBI: 1.64 DB_File: 1.852 Data::Dumper: 2.167 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.49 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.31 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 2.62 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.28 threads: 2.15 threads::shared: 1.56 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/mnt/sdb/funannotate/DB_funannotate $PASAHOME=/mnt/sdb/funannotate/PASApipeline $TRINITYHOME=/mnt/sdb/funannotate/trinityrnaseq-v2.9.0 $EVM_HOME=/mnt/sdb/funannotate/EVidenceModeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/mnt/sdb/funannotate/Augustus/config $GENEMARK_PATH=/mnt/sdb/funannotate/gm_et_linux_64/gmes_petap All 6 environmental variables are set

Checking external dependencies... PASA: 2.3.3 CodingQuarry: 2.0 Trinity: 2.9.0 augustus: 3.3.2 bamtools: bamtools 2.5.1 bedtools: bedtools v2.26.0 blat: BLAT v36x2 diamond: 0.9.24 emapper.py: 2.0.1 ete3: 3.1.1 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 gmes_petap.pl: 4.38 hisat2: 2.1.0 hmmscan: HMMER 3.2.1 (June 2018) hmmsearch: HMMER 3.2.1 (June 2018) java: 11.0.5 kallisto: 0.46.0 mafft: v7.310 (2017/Mar/17) makeblastdb: makeblastdb 2.9.0+ minimap2: 2.17-r943-dirty proteinortho: 6.0.10 pslCDnaFilter: no way to determine salmon: salmon 0.14.0 samtools: samtools 1.9-66-gc15e884 signalp: 4.1 snap: 2006-07-28 stringtie: 1.3.6 tRNAscan-SE: 1.3.1 (January 2012) tantan: tantan 13 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.9.0+ trimal: trimAl v1.4.rev22 build[2015-05-21] trimmomatic: 0.39 All 36 external dependencies are installed

to keep working i edited library.py in line 6366 where is: if f.type == 'CDS': and added: if f.type == 'CDS' and 'locus_tag' in f.qualifiers:

That solve the issue, but not it cause, so I share this here, to improve the software. Thanks.

nextgenusfs commented 4 years ago

Thanks for posting on here and including the logfiles -- A+ for bug report!

So I see this error in the annotate log file:

[01/03/20 14:37:41]: Existing tbl found: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_results/Globisporangium_ultimum_DAOM_BR144.tbl
[01/03/20 14:37:43]: putative transcript from ncbi:DAOMBR144_012983-T1 has no ID
(ncbi:DAOMBR144_012983-T1 None ncbi:DAOMBR144_012983-T1)

Would you be able to find this record in the NCBI tbl format? Need to trace down how this model was created without an ID, that file is: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_results/Globisporangium_ultimum_DAOM_BR144.tbl

Most likely this is the model missing 'locus_tag' -- so if we can fix this at its source that would be best.

There is another error with IPRscan parsing, see:

[01/03/20 14:37:58]: Parsing InterProScan5 XML file
[01/03/20 14:37:58]: /usr/bin/python /usr/local/lib/python2.7/dist-packages/funannotate-1.7.2-py2.7.egg/funannotate/aux_scripts/iprscan2annotations.py /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/annotate_misc/iprscan.xml /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/annotate_misc/annotations.iprscan.txt
[01/03/20 14:37:58]: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/funannotate-1.7.2-py2.7.egg/funannotate/aux_scripts/iprscan2annotations.py", line 32, in <module>
    for _, elem in tree:
  File "<string>", line 91, in next
cElementTree.ParseError: mismatched tag: line 46, column 4

How did you run IPRscan? There is a chance this is because of the locus_tag issue above.

atiweb commented 4 years ago

The file /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_results/Globisporangium_ultimum_DAOM_BR144.tbl was generated running funannotate update after predict, after train. All logs of it I have already share it. here is the tbl file generated by funannotate update:

Globisporangium_ultimum_DAOM_BR144.gbk.tar.gz Globisporangium_ultimum_DAOM_BR144.tbl.tar.gz

Im running iprscan locally, with this command: funannotate iprscan --input "/fun_out" \ --method local \ --num 500 \ --iprscan_path /mnt/sdb1/funannotate/interproscan/interproscan-5.38-76.0/interproscan.sh \ --cpus 4 There is the version used. Here is the output of it: iprscan.xml.tar.gz

I also think the error is related. Lovely to help...

atiweb commented 4 years ago

Ahh, the funnanotate train was feeded with the genome and SRA downloaded from NCBI, that organism have not proteins published in NCBI, so used SRA RNA to generate (RNA from NCBI too), i dont Know if is valid method, but seem to work so far, without sra and using funannotate own databases the the HiQ proteins generated are only a few dozens, and with sra they are thousands. I have developed a method to automatically do all of this, in multiple species at once. I will publish it here on github. Is using the funannotate in the pipeline, in bash script.

nextgenusfs commented 4 years ago

Okay, here is that record in the tbl file:

1031497 1030572 gene
            locus_tag   DAOMBR144_012983
1031497 1030572 mRNA
            product hypothetical protein
            transcript_id   gnl|ncbi|DAOMBR144_012983-T1_mrna
            protein_id  gnl|ncbi|DAOMBR144_012983-T1
1031476 1030571 CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|DAOMBR144_012983-T1_mrna
            protein_id  gnl|ncbi|DAOMBR144_012983-T1

And that tbl file turned into GenBank looks like this:

     assembly_gap    1007728..1018212
                     /estimated_length=10485
                     /gap_type="within scaffold"
                     /linkage_evidence="paired-ends"
     gene            1029059..1030398
                     /locus_tag="DAOMBR144_012982"
     mRNA            join(1029059..1029250,1029334..1029525,1029604..1030398)
                     /locus_tag="DAOMBR144_012982"
                     /product="hypothetical protein"
     CDS             join(1029059..1029250,1029334..1029525,1029604..1030398)
                     /locus_tag="DAOMBR144_012982"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:DAOMBR144_012982-T1"
                     /translation="MDSSQQPRGHLGVAQFDMTELDQIHEELRRMETGRDDDNNTDSV
                     GPMAPTAGSTQKKRTYEQRKEKLDTLTKEIKYLEAKLEYLKHSAGIPDTQTVEQQRIN
                     NALLREILRNQQYLTAGFRSALSADTSEHRPSPVTAQLHLGIDLRKRYEQLNDLRQEQ
                     LAGAKQFIDARTQFTNMTLRMSESSRFQSANGDTFAVKLDVIPLPQVKNVKQVYDAIV
                     YYMFNLEICLAEMLGDHVLREGDDDTGNARVSQHRLVTTNPDGLQVELNTVVFSDYNA
                     DAGRLDDDEAGGEGLITTDFVDSDELYPYRPHERLRKDITSIWMVKWYSTSQEQDQSI
                     RSPSAAQKKMVVLTRWVQSKLHRSAFDIPEDTLVELSESTNRATDAVLKAVRASLQFA
                     "
     CDS             complement(1030571..1031476)
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:DAOMBR144_012983-T1"
                     /translation="MAQFRSTKFRNAKIYSIFEITESSTLQTVRSQQSMPPRGSGTTI
                     LGGFFIDVKIFACVLTALFAIFVLRKIFKFGESLRNSNHKARMPKLQTRTPVSYSAGV
                     LWPVGSMCVLWTSDYFCVKSRYASSRYSTKASSKKVESEDGESEQELKKESETDSHDS
                     KAERTMSSVGDRSLSRRSFSSISSKIHDVIEVLPSVLMTSREFRSLQHQMESLHNRGD
                     EVEATIAFMNLVTMSDPIVYFCFILGGGGGKRLGYYQSLQDPEKYFLLPQDAIGRKDL
                     RARELKLIYSVNSPSLRLKDLIHCG"
     gene            complement(1030572..1031497)
                     /locus_tag="DAOMBR144_012983"
     mRNA            complement(1030572..1031497)
                     /locus_tag="DAOMBR144_012983"
                     /product="hypothetical protein"
     assembly_gap    1031497..1031668
                     /estimated_length=172
                     /gap_type="within scaffold"
                     /linkage_evidence="paired-ends"

So since this is a negative (crick) stranded feature, the CDS is actually outside the boundaries of the gene and mRNA, thus apparently tbl2asn that writes the GBK files writes this in the wrong order and therefore it apparently doesn't get parsed properly by the Genbank parsing function.

The proper tbl annotation for this gene should be:

1031497 1030571 gene
            locus_tag   DAOMBR144_012983
1031497 1030571 mRNA
            product hypothetical protein
            transcript_id   gnl|ncbi|DAOMBR144_012983-T1_mrna
            protein_id  gnl|ncbi|DAOMBR144_012983-T1
1031476 1030571 CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|DAOMBR144_012983-T1_mrna
            protein_id  gnl|ncbi|DAOMBR144_012983-T1

Would you be able to share this same genome result (tbl and gbk files) from predict? That way we can determine if the problem was caused initially during predict or if this happened in the update command.

atiweb commented 4 years ago

Yes, here are the files from predict: Globisporangium_ultimum_DAOM_BR144.gbk.tar.gz Globisporangium_ultimum_DAOM_BR144.tbl.tar.gz

Whatever you need, im at your service.

nextgenusfs commented 4 years ago

Thanks -- okay so it started out fine -- means this is a problem from update, more specifically from PASA trying to add UTRs

1031476 1030571 gene
            locus_tag   DAOMBR144_012983
1031476 1030571 mRNA
            product hypothetical protein
            transcript_id   gnl|ncbi|DAOMBR144_012983-T1_mrna
            protein_id  gnl|ncbi|DAOMBR144_012983-T1
1031476 1030571 CDS
            codon_start 1
            product hypothetical protein
            transcript_id   gnl|ncbi|DAOMBR144_012983-T1_mrna
            protein_id  gnl|ncbi|DAOMBR144_012983-T1

Okay, so now how about these intermediate files so I can determine which step the error occurred.
/mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_misc/pasa_final.gff3 /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_misc/bestmodels.gff3

Likely I'll have to write a function to verify the CDS --> mRNA --> gene coordinates make sense.

atiweb commented 4 years ago

Here are that files: update_misc_required_files.tar.gz

nextgenusfs commented 4 years ago

Bizarre... so it is correct in both of these files... but that helps narrow down the problem. PASA_final.gff3

# PASA_UPDATE: DAOMBR144_012983-T1, single gene model update, valid-1, status:[pasa:asmbl_15416,status:12], valid-1
scaffold_6  .   gene    1030571 1031497 .   -   .   ID=DAOMBR144_012983;Name=DAOMBR144_012983-T1
scaffold_6  .   mRNA    1030571 1031497 .   -   .   ID=DAOMBR144_012983-T1;Parent=DAOMBR144_012983;Name=DAOMBR144_012983-T1
scaffold_6  .   five_prime_UTR  1031477 1031497 .   -   .   ID=DAOMBR144_012983-T1.utr5p1;Parent=DAOMBR144_012983-T1
scaffold_6  .   exon    1030571 1031497 .   -   .   ID=DAOMBR144_012983-T1.exon1;Parent=DAOMBR144_012983-T1
scaffold_6  .   CDS 1030571 1031476 .   -   0   ID=cds.DAOMBR144_012983-T1;Parent=DAOMBR144_012983-T1

#PROT DAOMBR144_012983-T1 DAOMBR144_012983  MAQFRSTKFRNAKIYSIFEITESSTLQTVRSQQSMPPRGSGTTILGGFFIDVKIFACVLTALFAIFVLRKIFKFGESLRNSNHKARMPKLQTRTPVSYSAGVLWPVGSMCVLWTSDYFCVKSRYASSRYSTKASSKKVESEDGESEQELKKESETDSHDSKAERTMSSVGDRSLSRRSFSSISSKIHDVIEVLPSVLMTSREFRSLQHQMESLHNRGDEVEATIAFMNLVTMSDPIVYFCFILGGGGGKRLGYYQSLQDPEKYFLLPQDAIGRKDLRARELKLIYSVNSPSLRLKDLIHCG*

bestmodels.gff3

scaffold_6  PASA    gene    1030571 1031497 .   -   .   ID=DAOMBR144_012983;
scaffold_6  PASA    mRNA    1030571 1031497 .   -   .   ID=DAOMBR144_012983-T1;Parent=DAOMBR144_012983;Note=TPM:0.86;
scaffold_6  PASA    five_prime_UTR  1031477 1031497 .   -   .   ID=DAOMBR144_012983-T1.utr5p1;Parent=DAOMBR144_012983-T1;
scaffold_6  PASA    exon    1030571 1031497 .   -   .   ID=DAOMBR144_012983-T1.exon1;Parent=DAOMBR144_012983-T1;
scaffold_6  PASA    CDS 1030571 1031476 .   -   0   ID=cds.DAOMBR144_012983-T1;Parent=DAOMBR144_012983-T1;
nextgenusfs commented 4 years ago

Okay - getting somewhere now.... so it seems that this model ends in a gap region of the assembly, thus a helper function was trimming back that sequence -- because due to NCBI rules genes cannot end/start in gaps, so the script is suppose to trim back to a non gap basepair and then label the gene model partial. So I think the bug is actually in this trimming function -- it seems to have trimmed the wrong side of the gene model (likely a strand issue). Working on a fix now. A hard bug to find considering it would only show up in this edge case!

atiweb commented 4 years ago

Yes Jon, you are right from 20 genomes processed only in this last one show up this error, quite odd indeed.

nextgenusfs commented 4 years ago

Okay, just pushed a fix (I hope). If you installed via GitHub, should be able to do a git pull and then re-run the funannotate update command for that genome. It should overwrite the existing data I think -- you can see if it worked by looking at the gene model in the GenBank file, make sure that the coordinates of this one should be (1030571, 1031496).

atiweb commented 4 years ago

Im on it, will share results as soon the data be ready.

nextgenusfs commented 4 years ago

Oh -- and forgot to say (I think you know) but need to then re-install with pip after pulling the changes.

atiweb commented 4 years ago

Hi Jon, the following error raise up:

funannotate update -i /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/ --cpus 15
-------------------------------------------------------
[02:51 PM]: OS: linux2, 16 cores, ~ 33 GB RAM. Python: 2.7.17
[02:51 PM]: Running 1.7.3
[02:51 PM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[02:51 PM]: Found relevant files in /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/training, will re-use them:
    Forward normalized reads: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/training/normalize/left.norm.fq
    Reverse normalized reads: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/training/normalize/right.norm.fq
    Trinity results: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/training/funannotate_train.trinity-GG.fasta
    PASA config file: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/training/pasa/alignAssembly.txt
    BAM alignments: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/training/funannotate_train.coordSorted.bam
    StringTie GTF: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/training/funannotate_train.stringtie.gtf
[02:51 PM]: Reannotating Globisporangium_ultimum, NCBI accession: None
[02:51 PM]: Previous annotation consists of: 14,624 protein coding gene models and 594 non-coding gene models
[02:51 PM]: Trimmomatic will be skipped
[02:51 PM]: Existing BAM alignments found: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_misc/trinity.alignments.bam, /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_misc/transcript.alignments.bam
[02:51 PM]: Skipping PASA, found existing output: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_misc/pasa_final.gff3
[02:51 PM]: Existing Kallisto output found: /mnt/sdb/Globisporangium_ultimum/Globisporangium_ultimum/DAOM_BR144/fun_out/update_misc/kallisto.tsv
[02:51 PM]: Parsing Kallisto results. Keeping alt-splicing transcripts if expressed at least 10.0% of highest transcript per locus.
[02:51 PM]: Wrote 15,034 transcripts derived from 14,637 protein coding loci.
[02:51 PM]: Converting to Genbank format
[02:54 PM]: Collecting final annotation files
[02:54 PM]: Parsing GenBank files...comparing annotation
[02:54 PM]: putative transcript from ncbi:DAOMBR144_012983-T1 has no ID
(ncbi:DAOMBR144_012983-T1 None ncbi:DAOMBR144_012983-T1)
Traceback (most recent call last):
  File "/usr/local/bin/funannotate", line 657, in <module>
    main()
  File "/usr/local/bin/funannotate", line 647, in main
    mod.main(arguments)
  File "/usr/local/lib/python2.7/dist-packages/funannotate/update.py", line 2306, in main
    compareAnnotations2(GBK, final_gbk, Changes, args=args)
  File "/usr/local/lib/python2.7/dist-packages/funannotate/update.py", line 1304, in compareAnnotations2
    newGenes[gene[2]]['CDS'], hitInfo['CDS'])
  File "/usr/local/lib/python2.7/dist-packages/funannotate/update.py", line 1439, in pairwiseAED
    splitAED = [pAED[i:i+len(query)] for i in range(0, len(pAED), len(query))]
ValueError: range() step argument must not be zero
nextgenusfs commented 4 years ago

Okay - so not fixed. What does the record look like for DAOMBR144_012983 in the tbl file?

atiweb commented 4 years ago

1031497 1030572 gene locus_tag DAOMBR144_012983 1031497 1030572 mRNA product hypothetical protein transcript_id gnl|ncbi|DAOMBR144_012983-T1_mrna protein_id gnl|ncbi|DAOMBR144_012983-T1 1031476 1030571 CDS codon_start 1 product hypothetical protein transcript_id gnl|ncbi|DAOMBR144_012983-T1_mrna protein_id gnl|ncbi|DAOMBR144_012983-T1

Here is the full file if needed: Globisporangium_ultimum_DAOM_BR144.tbl.tar.gz

nextgenusfs commented 4 years ago

okay, i'll have to look at it when I get home from work.

nextgenusfs commented 4 years ago

Oh, I think you just need to delete a temporary folder as it re-used the existing (incorrect) tbl file, so delete the update_misc/tbl2asn directory and then re-run.

nextgenusfs commented 4 years ago

I will push an update to re-run that function even if existing file is present as that would make more sense, ie any other changes to pipeline previously won't take effect if that file exists.....

atiweb commented 4 years ago

Thank you Jon. Im on it. Will share results

atiweb commented 4 years ago

Hi Jon, I have runned again the funannotate iprscan and funannotate annotate command. When running funannotate annotate, the following error raise up: Parsing InterProScan5 XML file [01/13/20 11:23:57]: /usr/bin/python /usr/local/lib/python2.7/dist-packages/funannotate/aux_scripts/iprscan2annotations.py /mnt/sdb/auto_pyt/Globisporangium_ultimum/DAOM_BR144/fun_out/annotate_misc/iprscan.xml /mnt/sdb/auto_pyt/Globisporangium_ultimum/DAOM_BR144/fun_out/annotate_misc/annotations.iprscan.txt [01/13/20 11:23:57]: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/funannotate/auxscripts/iprscan2annotations.py", line 32, in for , elem in tree: File "", line 91, in next cElementTree.ParseError: mismatched tag: line 241, column 4

Here is the full log if needed and the iprscan xml file: iprscan.xml.tar.gz funannotate-annotate.log.tar.gz

Before all this, I deleted update_misc/tbl2asn and run again funannotate update Thank you

nextgenusfs commented 4 years ago

Hi @atiweb. I think this IPRscan issue is related to a change in the XML format in most recent interproscan output. This caused the split output files to not be combined properly. I think I pushed a fix to this yesterday, but I haven't been able to test it. Unfortunately the fix would require re-running IPRscan with the updated code.

nextgenusfs commented 4 years ago

And I just pushed an option for --debug to funannotate iprscan, this will keep the intermediate files so if that previous fix doesn't work, you shouldn't need to run it again.

atiweb commented 4 years ago

You are right Jon, is a problem of format of the xml output of interproscan. The problem is that some <protein> tag are missing. Thanks by the tips.

hyphaltip commented 4 years ago

Yeah I was just helpjng debug on my end.  I think this fix checked in solved it. Will do some more local testing.

Jason Stajich, PhD jasonstajich.phd@gmail.com On Jan 13, 2020, 12:30 PM -0800, Adalberto Garcia Garces notifications@github.com, wrote:

You are right Jon, is a problem of format of the xml output of interproscan. The problem is that some tag are missing. Thanks by the tips. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nextgenusfs commented 4 years ago

Let me know if this is now working again with newest IPRscan.

atiweb commented 4 years ago

Hand on it!!! Thanks by the great support. Will share result. By the way, I was trying to use the last signalp, version 5, but have to use the version 4 because the code of funannotate was not aware of the new format of the signalp 5. Did you already fix that?

nextgenusfs commented 4 years ago

Others have reported that signalP 5 works in the annotation step https://github.com/nextgenusfs/funannotate/issues/352 -- it just "fails" in funannotate check. I can't run it locally (the macOS binary has library dependency issue on my system and we don't have license at work so can't check on linux). I just need to know how you get the version out of signalP 5, ie. signalp --version? And specifically need to know if it prints to stderr or stdout in order to fix funannotate check.

hyphaltip commented 4 years ago
$ ./bin/signalp -version
SignalP version 5.0b Linux x86_64
nextgenusfs commented 4 years ago

Thanks @hyphaltip. And does this still show version or is it printed to stdout.

./bin/signalp -version > signalp.stdout

atiweb commented 4 years ago

Here is the output of both versions on my system, printed to stdout: $ signalp -V signalp 4.1

$ signalp-5.0/bin/signalp -version SignalP version 5.0 Linux x86_64

When running $ signalp-5.0/bin/signalp -version > signalp.stdout Create a file on current dir named signalp.stdout

Greetings

hyphaltip commented 4 years ago

Yes it is writing to stdout as @atiweb mentions too.

nextgenusfs commented 4 years ago

Great thanks.

atiweb commented 4 years ago

Let me know if this is now working again with newest IPRscan.

Just try now InterProScan-5.39-77.0 and the problem of missing <protein> tag persisted in the xml file output on standalone mode.

atiweb commented 4 years ago

I have coded a workaround to this problem of the missing protein tag on iprscan.xml file. Also reported this on the interproscan github page. See https://github.com/ebi-pf-team/interproscan/issues/135 This is the code:


fix_iprscan_issue.sh.tar.gz


Is coded in bash for Ubuntu 18.04, may be you can improve it and do it in python, is harmless if the iprscan.xml file is ok. Maybe the team of interproscan fix the issue, but meanwhile is handy. Not everybody too can update easly interproscan to last version if needed. Thanks.

nextgenusfs commented 4 years ago

Thanks -- so is the IPR XML output missing the tag when you run it manually or only if you run it with funannotate iprscan?

atiweb commented 4 years ago

Thanks -- so is the IPR XML output missing the tag when you run it manually or only if you run it with funannotate iprscan?

Until now running funannotate iprscan, but soon will try manually. I think that manually raise the same bug, but not 100% sure. Will report then.

nextgenusfs commented 4 years ago

Okay - so the error you posted on IPR GitHub is most likely a result of how funannotate is trying to recombine the runs from multiple IPRscan jobs. I just need to see the XML output from a small number of test proteins when its run manually.

atiweb commented 4 years ago

Okay - so the error you posted on IPR GitHub is most likely a result of how funannotate is trying to recombine the runs from multiple IPRscan jobs. I just need to see the XML output from a small number of test proteins when its run manually.

OK @nextgenusfs, will upload the required files when ready. I did not knew that funannotate recombine the runs of IPRscan, that change the scenario.

Thank you.

nextgenusfs commented 4 years ago

So in my speed tests, it was much much faster to split the input into chunks and launch X number of IPR jobs and then recombine the results than it is to run one large job. So it seems that the XML header has changed somewhat, for speed I was simply just skipping the first line of the output to combine the XML files which worked for several IPR-5 versions. This appears to now have changed. So either we need to read the XML file to get the tags correct and then combine or figure out what their new format is so can adjust the parsing/combining to work as intended.

atiweb commented 4 years ago

Great @nextgenusfs, hand on it then. Will upload files when ready. Do you need to be the same proteins files processed by IPRscan, manually and by funannotate iprscan? Wich file do i need to feed into IPRscan manually form the funannotate output folder?

nextgenusfs commented 4 years ago

It just uses the *.proteins.fasta file, but you can just grab the first like 500 proteins (doesn't matter what they are) and then could run the following:

1) run funannotate iprscan

funannotate iprscan -i proteins.fasta -n 100 -o test.funannotate.xml --debug

2) run IPRscan manually

interproscan.sh -i proteins.fasta -d . -f XML -goterms -pa

So what I want to see is the XML files in the temporary directory of funannotate iprscan outputs, the --debug flag should not delete the directory.

atiweb commented 4 years ago

Thanks @nextgenusfs. Will share results when ready.

atiweb commented 4 years ago

Fixed on https://github.com/nextgenusfs/funannotate/commit/f75299342417fd8636cf30b2832a26a6a4389c11 and https://github.com/nextgenusfs/funannotate/commit/f75299342417fd8636cf30b2832a26a6a4389c11 Thanks...