Closed gskrasnov closed 6 years ago
Ok, thanks. You may have better luck troubleshooting if you run Braker outside of funannotate, as sometimes the error messages get swallowed up by the logging in funannotate. Did it run with the added environmental variables?
I'm really frustrated with BRAKER/AUGUSTUS as the installation is a serious pain and every version has different options/dependencies. I've thought about packaging an "augustus-lite" version that contains all the parts that I use for funannotate, but I haven't found the time yet. I've asked the developers for help numerous times and never have gotten a response.
And I think I need to update the code for you to be able to run an assembly with that many contigs -> I think that NCBI's tbl2asn will crash. This was a previous issue I hadn't addressed yet https://github.com/nextgenusfs/funannotate/issues/66.
No, this did not helped. The same error. Next, I copied BRAKER perl script to augustus location, and this worked:
cp /home/linuxbrew/.linuxbrew/Cellar/braker/1.11/libexec/*.pl /home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts/
cp /home/linuxbrew/.linuxbrew/Cellar/braker/1.11/libexec/*.pm /home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts/
However another error has encountered:
NEXT STEP: check files and settings
WARNING: The command line option --AUGUSTUS_BIN_PATH was not used. This is ok if binaries of augustus reside in ../bin relative to AUGUSTUS_CONFIG_PATH. Otherwise, please specify --AUGUSTUS_BIN_PATH
WARNING: The command line option --AUGUSTUS_SCRIPTS_PATH was not used. This is ok if scripts of augustus reside in ../scripts relative to AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH (will first look for them in AUGUSTUS_BIN_PATH). Otherwise, please specify --AUGUSTUS_SCRIPTS_PATH
NEXT STEP: check options
ERROR:/home/linuxbrew/.linuxbrew/opt/augustus/libexec/config/species/arabidopsis already exists. Choose another species name, delete this directory or use the existing species with the option --useexisting.
... options check complete.
I'm analyzing Linum usitatissimum genome and specified arabidopsis as augustus pre-trained species (--augustus_species arabidopsis). Should I do this? Arabidopsis is not very close phylogenetically...
Since you are going to train Augustus, you can leave this blank. So try without the --augustus_species
flag. Let me know if it runs that way, I will make a check to disable that if so. Or you can specify a unique value for that flag, basically it is going to create a folder with that name. If you don't enter anything, then it generates a name from the --species
flag. i.e. you could pass --augustus_species linum_usitatissimum
which would be equivlement to just passing --species "Linum usitatissimum"
.
And I think I need to update the code for you to be able to run an assembly with that many contigs -> I think that NCBI's tbl2asn will crash. This was a previous issue I hadn't addressed yet #66.
Don't worry Jon, I will crop my assembly to min.contig.len=5000
I'm really frustrated with BRAKER/AUGUSTUS as the installation is a serious pain and every version has different options/dependencies
Yes, Augustus may be problematic to install. For example, one should have a specific version of boost libraries installed.... I followed the funannotate instructions and alsmost everything wenk ok
Now BRAKER is running. I will report run status later
funannotate predict has completed:
george@CCUserver:~/Flax-2017.Annotation$ funannotate predict -i 901.masurca.scf.fasta \
-o 901.fun_out --species "Linum usitatissimum" \
--pasa_gff FLax_901_GG_pasa.assemblies.fasta.transdecoder.genome.gff3 \
--rna_bam STAR.901.27+29+30.Aligned.out.sorted.bam \
--transcript_evidence Flax_27+29+30.901.GENOME.GUIDED.fasta --cpus 60 \
--repeatmasker_species arabidopsis --busco_seed_species arabidopsis \
--busco_db embryophyta --organism other --ploidy 2 --max_intronlen 20000
-------------------------------------------------------
[11:09:03 PM]: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
[11:09:03 PM]: Running funannotate v0.7.2
[11:09:06 PM]: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER1 and BUSCO
[11:11:16 PM]: Masked genome: 76,180 scaffolds; 357,782,925 bp; 0.56% repeats masked
[11:11:19 PM]: Using existing transcript evidence alignments
[11:11:19 PM]: 103,368 transcripts aligned with GMAP
[11:11:35 PM]: Using existing protein evidence alignments
[11:11:35 PM]: Now launching BRAKER to train GeneMark and Augustus
[03:43:13 AM]: Pulling out high quality Augustus predictions
[03:43:16 AM]: Found 0 high quality predictions from Augustus (>90% exon evidence)
[03:43:45 AM]: 274,218 total gene models from all sources
[03:43:45 AM]: Setting up EVM partitions
[04:03:33 AM]: Generating EVM command list
[04:03:34 AM]: Running EVM commands with 59 CPUs
[05:38:39 AM]: Combining partitioned EVM outputs
[05:38:49 AM]: Converting EVM output to GFF3
[07:18:52 AM]: Collecting all EVM results
[07:19:03 AM]: 77,011 total gene models from EVM
[07:19:03 AM]: Predicting tRNAs
[07:58:20 AM]: Found 1,187 tRNA gene models
[07:58:20 AM]: Merging EVM output with tRNAscan output
[07:58:20 AM]: Reformatting GFF file using GAG
[08:17:49 AM]: 78,198 total gene models
[08:17:49 AM]: Filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[08:33:16 AM]: 66,580 gene models remaining
[08:33:16 AM]: Converting to preliminary Genbank format
[09:41:20 AM]: Cleaning models flagged by tbl2asn
[09:43:01 AM]: 66,405 gene models remaining
[09:43:01 AM]: Re-naming gene models
[10:00:48 AM]: Converting to final Genbank format
[10:53:33 AM]: Collecting final annotation files
[10:53:34 AM]: 66,405 gene models
[10:54:25 AM]: Funannotate predict is finished, output files are in the 901.fun_out/predict_results folder
[10:54:25 AM]: Note, you should fix any tbl2asn errors now before running functional annotation.
tbl2asn run status seems ok....
Here is the output of funannotate predict (901.fun_out directory). You can find there all the log files there.
I'm confused a bit with this:
[03:43:16 AM]: Found 0 high quality predictions from Augustus (>90% exon evidence)
Thanks in advance!
Great, seems like it worked. Thanks for the logfiles, that helps. One thing I would be concerned about is why did tblastn find so few protein hits? It looks like you didn't pass --protein_evidence
which means that it would automatically use UniProt/SwissProt. Out of 500,000 proteins, tblastn pre-filter only found 349 potential alignments, that doesn't seem right. This might indicate a problem with the assembly? Or perhaps the tetraploidy is throwing it off? I don't think this is a huge problem, given the RNAseq data that you have.
The 0 high quality Augustus predictions is part of the script that tries to pull out gene models that are highly consistent with the evidence passed and then attempts to "weight" these gene models 5X higher than the rest of the ab initial predictions in the EvidenceModeler step. I see from your logfiles, that Braker in this case only uses introns as hints when running Augustus, thus you aren't getting any "high quality" hits. I don't remember this being the case on my machine, but I will look over the results again. I might need to change the sensitivity when Braker is used. However, for this particular annotation I wouldn't be concerned about it because you have PASA gene models, which will be "weighted" 10X the ab initio gene models -> so I don't think it would change the annotation much if at all.
You'll also note that the script didn't use BUSCO at all, this is because you passed the BAM file, which then uses the RNA-seq data to train. BUSCO is used when you don't have any RNAseq data.
Thank you very much for your detailed explanation.
Yes, I did not pass --protein_evidence
parameter.
It's very strange that tblastn found only 349 alignments... I tested my assembly quality with BUSCO and got almost nice results:
C:91.6%[S:32.1%,D:59.5%],F:2.4%,M:6.0%,n:1440
1319 Complete BUSCOs (C)
462 Complete and single-copy BUSCOs (S)
857 Complete and duplicated BUSCOs (D)
35 Fragmented BUSCOs (F)
86 Missing BUSCOs (M)
1440 Total BUSCO groups searched
I think duplicated BUSCOs are coming from tetraploidy of flax.
Out of 65.000 proteins 14.500 has been annotated using Blastp search of UniProt DB and 10,728 - using HMMer search of PFAM domains. I don't know if this is ok? The most of annotations came from eggnogg mapper. It worked greatly (strNOG, Streptophyta HMM database).
Here is funannotate annotate output:
funannotate annotate -i 901.fun_out --antismash ./plantismash.secondarymetabolites.org.results/471ac0f1-bb0f-4b49-a8c7-6a5ae7f5102f/jcf7180002550916.final.gbk --cpus 60 --eggnog 901.EGGNOGG.mapper.results.emapper.annotations --sbt template.sbt
-------------------------------------------------------
[01:07:47 AM]: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
[01:07:47 AM]: Running funannotate v0.7.2
[01:07:48 AM]: Parsing input files
[01:09:26 AM]: 65,288 protein records loaded
[01:09:26 AM]: Running HMMer search of PFAM domains
[02:15:58 AM]: 10,728 annotations added
[02:15:58 AM]: Running Blastp search of UniProt DB
[03:30:25 AM]: 14,497 annotations added
[03:30:25 AM]: Running Blastp search of MEROPS protease DB
[03:34:50 AM]: 1,541 annotations added
[03:34:50 AM]: Annotating CAZYmes using dbCAN
[04:36:35 AM]: 2,462 annotations added
[04:36:35 AM]: Parsing EggNog Annotations
[04:36:35 AM]: 94,912 annotations added
[04:36:35 AM]: Annotating proteins with BUSCO dikarya models
[04:37:43 AM]: 2,087 annotations added
[04:37:43 AM]: Skipping phobius predictions, try funannotate remote -m phobius
[04:37:43 AM]: Skipping secretome: neither SignalP nor Phobius installed
[04:37:43 AM]: 0 secretome and 0 transmembane annotations added
[04:37:43 AM]: InterProScan error, 901.fun_out/annotate_misc/iprscan.xml is empty, or no XML file passed via --iprscan. Annotation will be lacking.
[04:37:44 AM]: Now parsing antiSMASH results, finding SM clusters
[04:38:01 AM]: Found 88 clusters, 409 biosynthetic enyzmes, and 0 smCOGs predicted by antiSMASH
[04:38:04 AM]: Found 399 duplicated annotations, adding 127,206 valid annotations
[04:38:04 AM]: Adding annotations to GFF using GAG
[06:04:18 AM]: Fixing tRNA annotations in GenBank tbl file
[06:04:20 AM]: Converting to final Genbank format, good luck!.....
[06:57:01 AM]: Creating AGP file and corresponding contigs file
[06:57:15 AM]: Cross referencing SM cluster hits with MIBiG database
[06:57:26 AM]: Creating tab-delimited SM cluster output
[06:58:18 AM]: Writing genome annotation table.
[06:59:08 AM]: Funannotate annotate has completed successfully!
I updated funannotate results at the server, and here you can find annotate_misc/annotations.*.txt files
I also found that CDS annotation is not passed to mRNA and gene. For example:
gene complement(38646..40234)
/locus_tag="FUN_001186"
mRNA complement(join(38646..38801,38963..39025,39113..39279,
39372..39412,39503..39588,39685..39809,40018..40234))
/locus_tag="FUN_001186"
/product="hypothetical protein"
CDS complement(join(38646..38801,38963..39025,39113..39279,
39372..39412,39503..39588,39685..39809,40018..40234))
/locus_tag="FUN_001186"
/note="Rhomboid-like protein 11, chloroplastic;
MEROPS:MER003940;
EggNog:ENOG411E0AT;
COG:T"
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_001186-T1"
/translation="MMMHLQLQLHHQYPPLPNPSYPRLFFPSTARSFFPKHLAFTAPR
LAQPLFSHLLPFSSHAPLSRFICQSNDTDIPRQLELGKPEGRRKPERKVNGIFWIILV
NLGLFVADHFFQVRAIKSLYLYHNWPAWYQFVTATFCHASWEHLSSNLFFLYIFGKLV
EEEEGTFGLWFSYILTGVGANLVSWLVLPRNSVSVGASGAVFGLFAISVLVKLTWDWR
KIIEVLILGQFVVEKVMEAAQSSAAMSGSLGGAGQGVNHIAHLSGALIGVILVWLVSK
IPAAPTDD"
CDS is annotated as "Rhomboid-like protein 11, chloroplastic;"
but gene and mRNA have only /locus_tag="FUN_001186"
In 'classic' GenBank one can find an entry for GAPDH:
gene 1..3971
/gene="GAPDH"
/gene_synonym="G3PD; GAPD; HEL-S-162eP"
/note="glyceraldehyde-3-phosphate dehydrogenase; Derived
by automated computational analysis using gene prediction
method: BestRefSeq."
/db_xref="GeneID:2597"
/db_xref="HGNC:HGNC:4141"
/db_xref="HPRD:00713"
/db_xref="MIM:138400"
mRNA join(1..257,406..457,2090..2189,2280..2386,2516..2606,
2697..2812,2905..2986,3180..3592,3697..3971)
/gene="GAPDH"
/gene_synonym="G3PD; GAPD; HEL-S-162eP"
/product="glyceraldehyde-3-phosphate dehydrogenase,
transcript variant 3"
/note="Derived by automated computational analysis using
gene prediction method: BestRefSeq."
/transcript_id="NM_001289745.1"
/db_xref="GeneID:2597"
/db_xref="HGNC:HGNC:4141"
/db_xref="HPRD:00713"
/db_xref="MIM:138400"
CDS join(429..457,2090..2189,2280..2386,2516..2606,2697..2812,
2905..2986,3180..3592,3697..3766)
/gene="GAPDH"
/gene_synonym="G3PD; GAPD; HEL-S-162eP"
/note="isoform 1 is encoded by transcript variant 3;
Derived by automated computational analysis using gene
prediction method: BestRefSeq."
/codon_start=1
/product="glyceraldehyde-3-phosphate dehydrogenase isoform
1"
/protein_id="NP_001276674.1"
/db_xref="CCDS:CCDS8549.1"
/db_xref="GeneID:2597"
/db_xref="HGNC:HGNC:4141"
/db_xref="HPRD:00713"
/db_xref="MIM:138400"
/translation="MGKVKVGVNGFGRIGRLVTRAAFNSGKVDIVAINDPFIDLNYMV
YMFQYDSTHGKFHGTVKAENGKLVINGNPITIFQERDPSKIKWGDAGAEYVVESTGVF
TTMEKAGAHLQGGAKRVIISAPSADAPMFVMGVNHEKYDNSLKIISNASCTTNCLAPL
AKVIHDNFGIVEGLMTTVHAITATQKTVDGPSGKLWRDGRGALQNIIPASTGAAKAVG
KVIPELNGKLTGMAFRVPTANVSVVDLTCRLEKPAKYDDIKKVVKQASEGPLKGILGY
TEHQVVSSDFNSDTHSSTFDAGAGIALNDHFVKLISWYDNEFGYSNRVVDLMAHMASK
E"
Here, both gene and mRNA have the same annotation as CDS. I think, adding this feature would enhance funnannotate...
Hi George, Looks great! Glad everything is working. Per the GenBank format, the gen bank files are produced by tbl2asn and I think they are somewhat of a preliminary format compared to what NCBI will publish after it is processed internally. The whole impetus for writing funannotate was to make it easier to annotate a genome and submit it to NCBI. Therefore, as long as it is passing the tbl2asn genome checks, the final annotation will be done by NCBI. Having said that, the GBK files are fully functional. In most of the fungal genomes I've worked with, the CDS annotations are not duplicated in the mRNA field (since the annotations are actually based off of the protein translation and not the mRNA - which has made sense to me). Perhaps this is an example of how the format has "evolved" over the years. My goal is to make it as automated as possible to annotate and submit a genome, as prior to funannotate, it would take me weeks/months to manually annotate problematic gene models, etc.
But I appreciate the feedback, if there are some more tools, searches, etc that you feel would benefit the annotation, I'll do my best to try to add more functionality. I'm in the process of working out an automated Trinity/PASA upgrade of the gene models to correctly capture UTRs, as the current implementation doesn't do that at all. I'm also working on a curated database of gene product descriptions, currently all gene models are listed as 'hypothetical protein' because of the strict rules with naming genes and products in the annotation, its difficult to do this automatically.
Jon
Thank you, Jon
This is a wonderful pipeline greatly simplifies the process of genome annotation.
Is this OK that some EggNOG otholog annotations are not passed even to CDS? For example:
CDS join(5086..6132,6937..7557)
/locus_tag="FUN_000001"
/note="EggNog:ENOG411DQGU;
COG:Q"
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_000001-T1"
In the EggNOG emapper output I see:
FUN_000001 3694.POPTR_0008s21950.1 4.6e-165 542.3 GO:0005575,GO:0016020 map00363,map00624,map00627,map00903,map00945,map01100,map01110,map01120 virNOG[6] 1BFD4@strNOG,1DQGU@virNOG,COG2124@NOG,KOG0156@euNOG 1BFD4|4.20823705792e-201|669.263427734 Q cytochrome P450
e.g. annotation "cytochrome P450" is not passed to CDS. However this is very "high-level" annotation: cytochrome P450 family consists of dozens of members.
How can I cite funannotate in a paper?
George
Right, that's what I said I was working on -> a manually curated reference DB to extract gene names and product definitions. If you try to capture all of those, you will end up with hundreds of models that don't pass NCBI tbl2asn specs. My current plan is to create a tsv file of valid gene names with product definitions, so that when you use eggnog mapper it will pull a pre-screened product definition that will pass tbl2asn. Hopefully this will be available in next major release.
Working on a manuscript, but nothing is published yet. Right now you can cite the GitHub repo, I'll update it when citation is available.
Thank you very much, Jon!
Dear Jon,
I have launched funannotate predict with the following parameters, and it has failed:
There is no log file ./901.fun_out/logfiles/augustus-parallel.log However I found log file in ./901.fun_out/logfiles/braker.log:
I have installed linuxbrew in /home/linuxbrew/.linuxbrew. My user's homedir is /mnt/raid/illumina/geo
AUGUSTUS_BIN_PATH , AUGUSTUS_SCRIPTS_PATH , AUGUSTUS_CONFIG_PATH variables are empty
I found filterGenemark.pl script in the BRAKER's location (/home/linuxbrew/.linuxbrew/Cellar/braker/1.11/libexec/filterGenemark.pl), not augustus (/home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts)
Now I added these string to my .bashrc:
and run funannotate predict once again.....