nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
317 stars 83 forks source link

Antismash v5.0 isn't being parsed #292

Closed PlantDr430 closed 5 years ago

PlantDr430 commented 5 years ago

I am currently using the latest version which I pulled off of github today (v1.5.3-21ad095).

I am also using the newest version of antiSMASH v5, however, I noticed that the qualifiers in the .gbk output are different than previous .gbk that I had from antiSMASH v4. Perhaps these could be why not clusters or smCOGs are being parsed out.

I have attached my log file and a version of the .gbk results showing a portion of the output of antiSMASH v5. funannotate-annotate.log antiSMASH.results.txt

nextgenusfs commented 5 years ago

Yippeee! Always fun when formats change..... is there a tag saying which version of antismash the result is from?

Edit: looks like version in the comment section. So an updated parser will need to be added to the code.

PlantDr430 commented 5 years ago

yea, in the .gbk they have this:

antiSMASH-Data-START

        Version      :: 5.0.0rc1
        Run date     :: 2019-05-10 16:52:23

antiSMASH-Data-END

but there isn't a tag such as this in the v4 .gbk's

nextgenusfs commented 5 years ago

In the example you posted above, it seems that the annotation is not numerically incrementing properly, ie there are two 'protocluster' features, however, they say there are from the same "number". Is this the case throughout the gbk file output? Here are the two "protocluster" features:

     protocluster    31439..78329
                     /aStool="rule-based-clusters"
                     /contig_edge="False"
                     /core_location="join{[51438:51715](+), [51814:52199](+),
                     [52265:52794](+), [52859:57416](+), [57480:58329](+)}"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"
                     /tool="antismash"
     proto_core      join(51439..51715,51815..52199,52266..52794,52860..57416,
                     57481..58329)
                     /aStool="rule-based-clusters"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"

And then is one other one, looks like this:

     protocluster    64344..107816
                     /aStool="rule-based-clusters"
                     /contig_edge="True"
                     /core_location="join{[91647:92574](-), [91554:91580](-),
                     [91368:91464](-), [91070:91264](-), [85323:90989](-),
                     [85064:85241](-), [84343:84982](-)}"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"
                     /tool="antismash"
     proto_core      complement(join(84344..84982,85065..85241,85324..90989,
                     91071..91264,91369..91464,91555..91580,91648..92574))
                     /aStool="rule-based-clusters"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"

So I'm wondering if this is correct? These two "protocluster" features are part of the same cluster? Or is this a mistake? Does the html output match this? They appear to overlap -- so perhaps underlying code is correct. So does that mean that all "clusters" have this protocluster annotation or is this a subset of the cluster annotation?

nextgenusfs commented 5 years ago

Screen Shot 2019-05-27 at 12 41 04 PM Update: I think I figured out what is happening. It seems that the numbering is contig specific, i.e. it starts over counting from 1 for each GenBank record (contig). And then looks like they are now using a contig.num for naming on html.

PlantDr430 commented 5 years ago

That would make sense as my results usually only have one cluster per contig. Although, interesting that your run appears to indicate more clusters than mine. image

I also noticed that in some other contigs /protocluster_number="1" appeared with different products such as NRPS-like, or terpenes. Which would indicate that it isn't product related and does appear to be contig related.

     protocluster    9399..53295
                     /aStool="rule-based-clusters"
                     /contig_edge="False"
                     /core_location="[29398:33295](-)"
                     /cutoff="0"
                     /detection_rule="cds((PP-binding or NAD_binding_4) and
                     (AMP-binding or A-OX))"
                     /neighbourhood="20000"
                     /product="NRPS-like"
                     /protocluster_number="1"
                     /tool="antismash"
     proto_core      complement(29399..33295)
                     /aStool="rule-based-clusters"
                     /cutoff="0"
                     /detection_rule="cds((PP-binding or NAD_binding_4) and
                     (AMP-binding or A-OX))"
                     /neighbourhood="20000"
                     /product="NRPS-like"
                     /protocluster_number="1"
nextgenusfs commented 5 years ago

I didn't use the same genome ;)

nextgenusfs commented 5 years ago

Goal is to get this updated today, I'll post here when its working.

nextgenusfs commented 5 years ago

Okay, I think I have it fixed, if you wouldn't mind testing the latest commit that would be helpful. Version should be:

$ funannotate version
funannotate v1.6.0-046e957
PlantDr430 commented 5 years ago

The parser picked up on clusters and smCOGs, but stated that I don't have any backbone biosynthetic enzymes. While I believe I do have some as antiSMASH is picking up some genes are "core biosynthetic genes".

[03:48 PM]: Now parsing antiSMASH v5 results, finding SM clusters
[03:48 PM]: Found 32 clusters, 0 backbone biosynthetic enyzmes, and 77 smCOGs predicted by antiSMASH
[03:48 PM]: Found 0 duplicated annotations, adding 52,327 valid annotations
[03:48 PM]: Converting to final Genbank format, good luck!
[03:50 PM]: Creating AGP file and corresponding contigs file
[03:50 PM]: Cross referencing SM cluster hits with MIBiG database version 1.3
[03:50 PM]: Creating tab-delimited SM cluster output
[03:50 PM]: Writing genome annotation table.
[03:50 PM]: Funannotate annotate has completed successfully!

    We need YOUR help to improve gene names/product descriptions:
       0 gene/products names MUST be fixed, see LM461_fun_output/annotate_results/Gene2Products.must-fix.txt
       1 gene/product names need to be curated, see LM461_fun_output/annotate_results/Gene2Products.need-curating.txt
       60 gene/product names passed but are not in Database, see LM461_fun_output/annotate_results/Gene2Products.new-names-passed.txt

    Please consider contributing a PR at https://github.com/nextgenusfs/gene2product

-------------------------------------------------------
stephenwyka@bspmgenomics:/data/wyka$
nextgenusfs commented 5 years ago

Ok, thanks. Its not really a big deal/change, I don't think, as it is simply a counter. Do the results in annotate_result/*.cluster.txt make sense?

Wonder if this is difference in 5.0.0 [what I ran on web server] and 5.0.0rc1 [which seems to be what you have].

PlantDr430 commented 5 years ago

Yes, the results in annotate_result/*.cluster.txt make sense

nextgenusfs commented 5 years ago

Thanks, I'll see if I can fix the counter.

nextgenusfs commented 5 years ago

Okay, should now be counting the biosynthetic enzymes based on the 'gene_kind' = 'biosynthetic' in the CDS metadata.

PlantDr430 commented 5 years ago

Thank you

PlantDr430 commented 5 years ago

So this fixed worked on all my genomes except one, where I got this error:

[03:12 PM]: Now parsing antiSMASH v5 results, finding SM clusters
Traceback (most recent call last):
  File "/data/wyka/funannotate-master/bin/funannotate-functional.py", line 878, in <module>
    lib.ParseAntiSmash(antismash_input, AntiSmashFolder, AntiSmashBed, AntiSmash_annotations) #results in several global dictionaries
  File "/data/wyka/funannotate-master/lib/library.py", line 5320, in ParseAntiSmash
    numericalContig = int(''.join(filter(str.isdigit, chr)))
UnboundLocalError: local variable 'chr' referenced before assignment
stephenwyka@bspmgenomics:/data/wyka/final_funannotate/Cpur20_1$
nextgenusfs commented 5 years ago

Thanks, that one was typo: https://github.com/nextgenusfs/funannotate/commit/0c6732d0f408e66822cc3eea1c159aa6d74ceb9c. git pull should fix it.

PlantDr430 commented 5 years ago

Got it to work