nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
319 stars 83 forks source link

BRAKER2 with InterProScan #308

Closed mictadlo closed 4 years ago

mictadlo commented 5 years ago

Hi Unfortunately, I discover your project too late and ran BRAKER2 and additionally I ran InterProScan which produced GFF3, TSV and XML. It appears that the InterProScan's GFF3 file does not contain any chromosome names:

##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##interproscan-version 5.36-75.0
##sequence-region g109343.t1 1 1358
g109343.t1      .       polypeptide     1       1358    .       +       .       ID=g109343.t1;md5=3e908dc966fefe367e64dc9d98b0d3ab
g109343.t1      ProSiteProfiles protein_match   628     724     18.261  +       .       date=20-07-2019;Target=g109343.t1 628 724;Ontology_term="GO:0015074";ID=match$1_628_724;signature_desc=Integrase c
atalytic domain profile.;Name=PS50994;status=T;Dbxref="InterPro:IPR001584"
g109343.t1      SUPERFAMILY     protein_match   586     624     4.19E-5 +       .       date=20-07-2019;Target=g109343.t1 586 624;Ontology_term="GO:0003676","GO:0008270";ID=match$2_586_624;Name=SSF57756
;status=T;Dbxref="InterPro:IPR036875"
g109343.t1      SUPERFAMILY     protein_match   622     725     4.93E-29        +       .       date=20-07-2019;Target=g109343.t1 622 725;ID=match$3_622_725;Name=SSF53098;status=T;Dbxref="InterPro:IPR01
2337"
g109343.t1      ProSiteProfiles protein_match   278     294     9.636   +       .       date=20-07-2019;Target=g109343.t1 278 294;Ontology_term="GO:0003676","GO:0008270";ID=match$4_278_294;signature_des
c=Zinc finger CCHC-type profile.;Name=PS50158;status=T;Dbxref="InterPro:IPR001878"
g109343.t1      SMART   protein_match   600     616     0.36    +       .       date=20-07-2019;Target=g109343.t1 600 616;Ontology_term="GO:0003676","GO:0008270";ID=match$5_600_616;Name=SM00343;status=T
;Dbxref="InterPro:IPR001878"
g109343.t1      SMART   protein_match   278     294     6.3E-4  +       .       date=20-07-2019;Target=g109343.t1 278 294;Ontology_term="GO:0003676","GO:0008270";ID=match$5_278_294;Name=SM00343;status=T
;Dbxref="InterPro:IPR001878"
g109343.t1      Pfam    protein_match   82      216     1.2E-24 +       .       date=20-07-2019;Target=g109343.t1 82 216;ID=match$6_82_216;signature_desc=gag-polypeptide of LTR copia-type;Name=PF14223;s
tatus=T

Is there a solution how to load InterProScan results as a track into JBrowse or is it still possible combine InterProScan and BRAKER2 with your scripts?

Thank you in advance,

Michal

hyphaltip commented 5 years ago

interpro will not be in chromosomal coordinates but in protein coordinates since it is annotating domains in proteins. if you want to map proteins domains to genomic coordinates you need to go through a transformation. Here's an example of how I did this with Pfam domains many years ago with BioPerl. There may be other alternative ways to do this but I would reach out to JBrowse developers as to how they reccomend adding protein domain tracks for genes in chromosome space.

https://github.com/hyphaltip/genome-scripts/blob/master/gbrowse_tools/map_hmmertab2genome.pl

nextgenusfs commented 5 years ago

If you want to add functional annotation to your coding gene model predictions, then you can pass your FASTA genome + GFF annotation + InterPro XML annotation file to funannotate annotate:

funannotate annotate --fasta genome.fa --gff braker.gff3 --iprscan protIPR.xml \
    --out output_folder --species "Genus species" 

This will extract protein models and then assign functional annotation to those predictions.

To re-predict gene models with funannotate, I would suggest starting from the beginning of the workflow if you have RNA-seq data. That would be mask --> train --> predict --> update --> annotate.

mictadlo commented 5 years ago

Hi, I ran into this problem:

docker run -it --rm -v $PWD:/home/linuxbrew/data nextgenusfs/funannotate
~/data$ funannotate annotate --fasta NbV1ChF.fasta --gff braker-NbAllMerged-BAM-soft_utr.gff3 --iprscan augustus.hints_utr.aa.xml --out out --species "NBenth"

What did I miss and I also tried to funannotate setup but which argument should I use?

Thank you in advance.

Michal

nextgenusfs commented 5 years ago

What’s the error?

mictadlo commented 5 years ago
$ funannotate annotate --fasta NbV1ChF.fasta --gff braker-NbAllMerged-BAM-soft_utr.gff3 --iprscan augustus.hints_utr.aa.xml --out out --species "NBenth" 
-------------------------------------------------------
[02:58 PM]: OS: linux2, 4 cores, ~ 5 GB RAM. Python: 2.7.15
[02:58 PM]: Running funannotate v1.5.3
[02:58 PM]: Database files not found in /home/linuxbrew/DB, run funannotate database and/or funannotate setup
nextgenusfs commented 5 years ago

Okay -- likely due to the MiBIG database moving location. So I assume you got an error during the docker image build? I'll have to tag a new release to fix this -- which we are planning shortly.

mictadlo commented 5 years ago

I used docker run -it --rm -v $PWD:/home/linuxbrew/data nextgenusfs/funannotate and therefore I do not think I though any errors during the docker build.

nextgenusfs commented 5 years ago

You need to follow the directions here: https://funannotate.readthedocs.io/en/latest/docker.html#docker. The base image nextgenusfs/funannotate requires a few more steps by the user due to licensing issues. This step then also sets up the databases in the docker image. I'm pushing v1.6.0 shortly to the docker cloud which should fix several issues but one being the link to a database that was broken.

mictadlo commented 5 years ago

It appears that RepBase is not anymore free for academics. Will your pipeline work also without RepBase?

nextgenusfs commented 5 years ago

Current repeat masking is done with the funannotate mask command, this uses repeatmasker. You can mask with any other software. Funannotate predict will warm you if your assembly is not masked, but you can bypass that warning.