nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Augustus/BRAKER failed #109

Closed gskrasnov closed 6 years ago

gskrasnov commented 6 years ago

Dear Jon,

I have launched funannotate predict with the following parameters, and it has failed:

funannotate predict -i 901.masurca.scf.fasta -o 901.fun_out --species "Linum usitatissimum" --pasa_gff FLax_901_GG_pasa.assemblies.fasta.transdecoder.genome.gff3 --rna_bam STAR.901.27+29+30.Aligned.out.sorted.bam  --transcript_evidence Flax_27+29+30.901.GENOME.GUIDED.fasta --cpus 60 --augustus_species arabidopsis --repeatmasker_species arabidopsis  --busco_seed_species arabidopsis --busco_db embryophyta --organism other --ploidy 2 --max_intronlen 20000
-------------------------------------------------------
[08:00:35 PM]: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
[08:00:36 PM]: Running funannotate v0.7.2
[08:00:36 PM]: Downloading embryophyta busco models
[08:00:47 PM]: Augustus training set for arabidopsis already exists, thus funannotate will use those parameters.
                If you want to re-train, provide a unique name for the --augustus_species argument
[08:00:49 PM]: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER1 and BUSCO
[08:02:23 PM]: Soft-masking: running RepeatMasker using arabidopsis species
[08:40:33 PM]: Masked genome: 76,180 scaffolds; 357,782,925 bp; 0.56% repeats masked
[08:40:33 PM]: Aligning transcript evidence to genome with GMAP
[08:53:12 PM]: 103,368 transcripts aligned with GMAP
[08:53:12 PM]: Aligning transcript evidence to genome with BLAT
[10:12:58 PM]: 118,239 filtered BLAT alignments
[10:13:10 PM]: Mapping proteins to genome using tBlastn/Exonerate
[10:13:11 PM]: Using 543,411 proteins as queries
[10:13:11 PM]: Running pre-filter tBLASTN step
[10:14:14 PM]: Found 349 preliminary alignments
[10:14:37 PM]: Exonerate finished: found 19 alignments
[10:14:39 PM]: arabidopsis as already been trained, using existing parameters
[10:14:39 PM]: Now launching BRAKER to train GeneMark and Augustus
[10:14:39 PM]: GeneMark predictions failed, proceeding with only Augustus
[10:14:39 PM]: Augustus prediction failed, check `logfiles/augustus-parallel.log`

There is no log file ./901.fun_out/logfiles/augustus-parallel.log However I found log file in ./901.fun_out/logfiles/braker.log:

NEXT STEP: check files and settings
WARNING: The command line option --AUGUSTUS_BIN_PATH was not used. This is ok if binaries of augustus reside in ../bin relative to AUGUSTUS_CONFIG_PATH. Otherwise, please specify --AUGUSTUS_BIN_PATH
WARNING: The command line option --AUGUSTUS_SCRIPTS_PATH was not used. This is ok if scripts of augustus reside in ../scripts relative to AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH (will first look for them in AUGUSTUS_BIN_PATH). Otherwise, please specify --AUGUSTUS_SCRIPTS_PATH
Error: found neither /home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts/filterGenemark.pl nor /home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts/filterGenemark.pl nor /home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts/filterGenemark.pl nor /home/linuxbrew/.linuxbrew/bin/filterGenemark.pl!
Please Check the environment variables AUGUSTUS_CONFIG_PATH and command line options AUGUSTUS_BIN_PATH and AUGUSTUS_SCRIPTS_PATH or install AUGUSTUS, again!

I have installed linuxbrew in /home/linuxbrew/.linuxbrew. My user's homedir is /mnt/raid/illumina/geo

AUGUSTUS_BIN_PATH , AUGUSTUS_SCRIPTS_PATH , AUGUSTUS_CONFIG_PATH variables are empty

I found filterGenemark.pl script in the BRAKER's location (/home/linuxbrew/.linuxbrew/Cellar/braker/1.11/libexec/filterGenemark.pl), not augustus (/home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts)

Now I added these string to my .bashrc:

export AUGUSTUS_SCRIPTS_PATH=/home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts:/home/linuxbrew/.linuxbrew/Cellar/braker/1.11/libexec
export AUGUSTUS_BIN_PATH=/home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/bin:/home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/bin:/home/linuxbrew/.linuxbrew/Cellar/braker/1.11/bin

and run funannotate predict once again.....

nextgenusfs commented 6 years ago

Ok, thanks. You may have better luck troubleshooting if you run Braker outside of funannotate, as sometimes the error messages get swallowed up by the logging in funannotate. Did it run with the added environmental variables?

I'm really frustrated with BRAKER/AUGUSTUS as the installation is a serious pain and every version has different options/dependencies. I've thought about packaging an "augustus-lite" version that contains all the parts that I use for funannotate, but I haven't found the time yet. I've asked the developers for help numerous times and never have gotten a response.

And I think I need to update the code for you to be able to run an assembly with that many contigs -> I think that NCBI's tbl2asn will crash. This was a previous issue I hadn't addressed yet https://github.com/nextgenusfs/funannotate/issues/66.

gskrasnov commented 6 years ago

No, this did not helped. The same error. Next, I copied BRAKER perl script to augustus location, and this worked:

cp /home/linuxbrew/.linuxbrew/Cellar/braker/1.11/libexec/*.pl /home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts/
cp /home/linuxbrew/.linuxbrew/Cellar/braker/1.11/libexec/*.pm /home/linuxbrew/.linuxbrew/Cellar/augustus/3.2.2_4/libexec/scripts/

However another error has encountered:

NEXT STEP: check files and settings
WARNING: The command line option --AUGUSTUS_BIN_PATH was not used. This is ok if binaries of augustus reside in ../bin relative to AUGUSTUS_CONFIG_PATH. Otherwise, please specify --AUGUSTUS_BIN_PATH
WARNING: The command line option --AUGUSTUS_SCRIPTS_PATH was not used. This is ok if scripts of augustus reside in ../scripts relative to AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH (will first look for them in AUGUSTUS_BIN_PATH). Otherwise, please specify --AUGUSTUS_SCRIPTS_PATH
NEXT STEP: check options
ERROR:/home/linuxbrew/.linuxbrew/opt/augustus/libexec/config/species/arabidopsis already exists. Choose another species name, delete this directory or use the existing species with the option --useexisting.
... options check complete.

I'm analyzing Linum usitatissimum genome and specified arabidopsis as augustus pre-trained species (--augustus_species arabidopsis). Should I do this? Arabidopsis is not very close phylogenetically...

nextgenusfs commented 6 years ago

Since you are going to train Augustus, you can leave this blank. So try without the --augustus_species flag. Let me know if it runs that way, I will make a check to disable that if so. Or you can specify a unique value for that flag, basically it is going to create a folder with that name. If you don't enter anything, then it generates a name from the --species flag. i.e. you could pass --augustus_species linum_usitatissimum which would be equivlement to just passing --species "Linum usitatissimum".

gskrasnov commented 6 years ago

And I think I need to update the code for you to be able to run an assembly with that many contigs -> I think that NCBI's tbl2asn will crash. This was a previous issue I hadn't addressed yet #66.

Don't worry Jon, I will crop my assembly to min.contig.len=5000

I'm really frustrated with BRAKER/AUGUSTUS as the installation is a serious pain and every version has different options/dependencies

Yes, Augustus may be problematic to install. For example, one should have a specific version of boost libraries installed.... I followed the funannotate instructions and alsmost everything wenk ok

Now BRAKER is running. I will report run status later

gskrasnov commented 6 years ago

funannotate predict has completed:

george@CCUserver:~/Flax-2017.Annotation$ funannotate predict -i 901.masurca.scf.fasta \
   -o 901.fun_out --species "Linum usitatissimum" \
   --pasa_gff FLax_901_GG_pasa.assemblies.fasta.transdecoder.genome.gff3 \
   --rna_bam STAR.901.27+29+30.Aligned.out.sorted.bam  \
   --transcript_evidence Flax_27+29+30.901.GENOME.GUIDED.fasta --cpus 60 \
   --repeatmasker_species arabidopsis  --busco_seed_species arabidopsis \
   --busco_db embryophyta --organism other --ploidy 2 --max_intronlen 20000
-------------------------------------------------------
[11:09:03 PM]: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
[11:09:03 PM]: Running funannotate v0.7.2
[11:09:06 PM]: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER1 and BUSCO
[11:11:16 PM]: Masked genome: 76,180 scaffolds; 357,782,925 bp; 0.56% repeats masked
[11:11:19 PM]: Using existing transcript evidence alignments
[11:11:19 PM]: 103,368 transcripts aligned with GMAP
[11:11:35 PM]: Using existing protein evidence alignments
[11:11:35 PM]: Now launching BRAKER to train GeneMark and Augustus
[03:43:13 AM]: Pulling out high quality Augustus predictions
[03:43:16 AM]: Found 0 high quality predictions from Augustus (>90% exon evidence)
[03:43:45 AM]: 274,218 total gene models from all sources
[03:43:45 AM]: Setting up EVM partitions
[04:03:33 AM]: Generating EVM command list
[04:03:34 AM]: Running EVM commands with 59 CPUs
[05:38:39 AM]: Combining partitioned EVM outputs
[05:38:49 AM]: Converting EVM output to GFF3
[07:18:52 AM]: Collecting all EVM results
[07:19:03 AM]: 77,011 total gene models from EVM
[07:19:03 AM]: Predicting tRNAs
[07:58:20 AM]: Found 1,187 tRNA gene models
[07:58:20 AM]: Merging EVM output with tRNAscan output
[07:58:20 AM]: Reformatting GFF file using GAG
[08:17:49 AM]: 78,198 total gene models
[08:17:49 AM]: Filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[08:33:16 AM]: 66,580 gene models remaining
[08:33:16 AM]: Converting to preliminary Genbank format
[09:41:20 AM]: Cleaning models flagged by tbl2asn
[09:43:01 AM]: 66,405 gene models remaining
[09:43:01 AM]: Re-naming gene models
[10:00:48 AM]: Converting to final Genbank format
[10:53:33 AM]: Collecting final annotation files
[10:53:34 AM]: 66,405 gene models
[10:54:25 AM]: Funannotate predict is finished, output files are in the 901.fun_out/predict_results folder
[10:54:25 AM]: Note, you should fix any tbl2asn errors now before running functional annotation.

tbl2asn run status seems ok....

Here is the output of funannotate predict (901.fun_out directory). You can find there all the log files there.

I'm confused a bit with this:

[03:43:16 AM]: Found 0 high quality predictions from Augustus (>90% exon evidence)

Thanks in advance!

nextgenusfs commented 6 years ago

Great, seems like it worked. Thanks for the logfiles, that helps. One thing I would be concerned about is why did tblastn find so few protein hits? It looks like you didn't pass --protein_evidence which means that it would automatically use UniProt/SwissProt. Out of 500,000 proteins, tblastn pre-filter only found 349 potential alignments, that doesn't seem right. This might indicate a problem with the assembly? Or perhaps the tetraploidy is throwing it off? I don't think this is a huge problem, given the RNAseq data that you have.

The 0 high quality Augustus predictions is part of the script that tries to pull out gene models that are highly consistent with the evidence passed and then attempts to "weight" these gene models 5X higher than the rest of the ab initial predictions in the EvidenceModeler step. I see from your logfiles, that Braker in this case only uses introns as hints when running Augustus, thus you aren't getting any "high quality" hits. I don't remember this being the case on my machine, but I will look over the results again. I might need to change the sensitivity when Braker is used. However, for this particular annotation I wouldn't be concerned about it because you have PASA gene models, which will be "weighted" 10X the ab initio gene models -> so I don't think it would change the annotation much if at all.

You'll also note that the script didn't use BUSCO at all, this is because you passed the BAM file, which then uses the RNA-seq data to train. BUSCO is used when you don't have any RNAseq data.

gskrasnov commented 6 years ago

Thank you very much for your detailed explanation.

Yes, I did not pass --protein_evidence parameter. It's very strange that tblastn found only 349 alignments... I tested my assembly quality with BUSCO and got almost nice results:

    C:91.6%[S:32.1%,D:59.5%],F:2.4%,M:6.0%,n:1440

    1319    Complete BUSCOs (C)
    462 Complete and single-copy BUSCOs (S)
    857 Complete and duplicated BUSCOs (D)
    35  Fragmented BUSCOs (F)
    86  Missing BUSCOs (M)
    1440    Total BUSCO groups searched

I think duplicated BUSCOs are coming from tetraploidy of flax.

Out of 65.000 proteins 14.500 has been annotated using Blastp search of UniProt DB and 10,728 - using HMMer search of PFAM domains. I don't know if this is ok? The most of annotations came from eggnogg mapper. It worked greatly (strNOG, Streptophyta HMM database).

Here is funannotate annotate output:

funannotate annotate -i 901.fun_out --antismash ./plantismash.secondarymetabolites.org.results/471ac0f1-bb0f-4b49-a8c7-6a5ae7f5102f/jcf7180002550916.final.gbk --cpus 60 --eggnog 901.EGGNOGG.mapper.results.emapper.annotations  --sbt template.sbt
-------------------------------------------------------
[01:07:47 AM]: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
[01:07:47 AM]: Running funannotate v0.7.2
[01:07:48 AM]: Parsing input files
[01:09:26 AM]: 65,288 protein records loaded
[01:09:26 AM]: Running HMMer search of PFAM domains
[02:15:58 AM]: 10,728 annotations added
[02:15:58 AM]: Running Blastp search of UniProt DB
[03:30:25 AM]: 14,497 annotations added
[03:30:25 AM]: Running Blastp search of MEROPS protease DB
[03:34:50 AM]: 1,541 annotations added
[03:34:50 AM]: Annotating CAZYmes using dbCAN
[04:36:35 AM]: 2,462 annotations added
[04:36:35 AM]: Parsing EggNog Annotations
[04:36:35 AM]: 94,912 annotations added
[04:36:35 AM]: Annotating proteins with BUSCO dikarya models
[04:37:43 AM]: 2,087 annotations added
[04:37:43 AM]: Skipping phobius predictions, try funannotate remote -m phobius
[04:37:43 AM]: Skipping secretome: neither SignalP nor Phobius installed
[04:37:43 AM]: 0 secretome and 0 transmembane annotations added
[04:37:43 AM]: InterProScan error, 901.fun_out/annotate_misc/iprscan.xml is empty, or no XML file passed via --iprscan. Annotation will be lacking.
[04:37:44 AM]: Now parsing antiSMASH results, finding SM clusters
[04:38:01 AM]: Found 88 clusters, 409 biosynthetic enyzmes, and 0 smCOGs predicted by antiSMASH
[04:38:04 AM]: Found 399 duplicated annotations, adding 127,206 valid annotations
[04:38:04 AM]: Adding annotations to GFF using GAG
[06:04:18 AM]: Fixing tRNA annotations in GenBank tbl file
[06:04:20 AM]: Converting to final Genbank format, good luck!.....
[06:57:01 AM]: Creating AGP file and corresponding contigs file
[06:57:15 AM]: Cross referencing SM cluster hits with MIBiG database
[06:57:26 AM]: Creating tab-delimited SM cluster output
[06:58:18 AM]: Writing genome annotation table.
[06:59:08 AM]: Funannotate annotate has completed successfully!

I updated funannotate results at the server, and here you can find annotate_misc/annotations.*.txt files

gskrasnov commented 6 years ago

I also found that CDS annotation is not passed to mRNA and gene. For example:

     gene            complement(38646..40234)
                     /locus_tag="FUN_001186"
     mRNA            complement(join(38646..38801,38963..39025,39113..39279,
                     39372..39412,39503..39588,39685..39809,40018..40234))
                     /locus_tag="FUN_001186"
                     /product="hypothetical protein"
     CDS             complement(join(38646..38801,38963..39025,39113..39279,
                     39372..39412,39503..39588,39685..39809,40018..40234))
                     /locus_tag="FUN_001186"
                     /note="Rhomboid-like protein 11, chloroplastic;
                     MEROPS:MER003940;
                     EggNog:ENOG411E0AT;
                     COG:T"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:FUN_001186-T1"
                     /translation="MMMHLQLQLHHQYPPLPNPSYPRLFFPSTARSFFPKHLAFTAPR
                     LAQPLFSHLLPFSSHAPLSRFICQSNDTDIPRQLELGKPEGRRKPERKVNGIFWIILV
                     NLGLFVADHFFQVRAIKSLYLYHNWPAWYQFVTATFCHASWEHLSSNLFFLYIFGKLV
                     EEEEGTFGLWFSYILTGVGANLVSWLVLPRNSVSVGASGAVFGLFAISVLVKLTWDWR
                     KIIEVLILGQFVVEKVMEAAQSSAAMSGSLGGAGQGVNHIAHLSGALIGVILVWLVSK
                     IPAAPTDD"

CDS is annotated as "Rhomboid-like protein 11, chloroplastic;" but gene and mRNA have only /locus_tag="FUN_001186"

In 'classic' GenBank one can find an entry for GAPDH:

gene            1..3971
                     /gene="GAPDH"
                     /gene_synonym="G3PD; GAPD; HEL-S-162eP"
                     /note="glyceraldehyde-3-phosphate dehydrogenase; Derived
                     by automated computational analysis using gene prediction
                     method: BestRefSeq."
                     /db_xref="GeneID:2597"
                     /db_xref="HGNC:HGNC:4141"
                     /db_xref="HPRD:00713"
                     /db_xref="MIM:138400"
     mRNA            join(1..257,406..457,2090..2189,2280..2386,2516..2606,
                     2697..2812,2905..2986,3180..3592,3697..3971)
                     /gene="GAPDH"
                     /gene_synonym="G3PD; GAPD; HEL-S-162eP"
                     /product="glyceraldehyde-3-phosphate dehydrogenase,
                     transcript variant 3"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefSeq."
                     /transcript_id="NM_001289745.1"
                     /db_xref="GeneID:2597"
                     /db_xref="HGNC:HGNC:4141"
                     /db_xref="HPRD:00713"
                     /db_xref="MIM:138400"
     CDS             join(429..457,2090..2189,2280..2386,2516..2606,2697..2812,
                     2905..2986,3180..3592,3697..3766)
                     /gene="GAPDH"
                     /gene_synonym="G3PD; GAPD; HEL-S-162eP"
                     /note="isoform 1 is encoded by transcript variant 3;
                     Derived by automated computational analysis using gene
                     prediction method: BestRefSeq."
                     /codon_start=1
                     /product="glyceraldehyde-3-phosphate dehydrogenase isoform
                     1"
                     /protein_id="NP_001276674.1"
                     /db_xref="CCDS:CCDS8549.1"
                     /db_xref="GeneID:2597"
                     /db_xref="HGNC:HGNC:4141"
                     /db_xref="HPRD:00713"
                     /db_xref="MIM:138400"
                     /translation="MGKVKVGVNGFGRIGRLVTRAAFNSGKVDIVAINDPFIDLNYMV
                     YMFQYDSTHGKFHGTVKAENGKLVINGNPITIFQERDPSKIKWGDAGAEYVVESTGVF
                     TTMEKAGAHLQGGAKRVIISAPSADAPMFVMGVNHEKYDNSLKIISNASCTTNCLAPL
                     AKVIHDNFGIVEGLMTTVHAITATQKTVDGPSGKLWRDGRGALQNIIPASTGAAKAVG
                     KVIPELNGKLTGMAFRVPTANVSVVDLTCRLEKPAKYDDIKKVVKQASEGPLKGILGY
                     TEHQVVSSDFNSDTHSSTFDAGAGIALNDHFVKLISWYDNEFGYSNRVVDLMAHMASK
                     E"

Here, both gene and mRNA have the same annotation as CDS. I think, adding this feature would enhance funnannotate...

nextgenusfs commented 6 years ago

Hi George, Looks great! Glad everything is working. Per the GenBank format, the gen bank files are produced by tbl2asn and I think they are somewhat of a preliminary format compared to what NCBI will publish after it is processed internally. The whole impetus for writing funannotate was to make it easier to annotate a genome and submit it to NCBI. Therefore, as long as it is passing the tbl2asn genome checks, the final annotation will be done by NCBI. Having said that, the GBK files are fully functional. In most of the fungal genomes I've worked with, the CDS annotations are not duplicated in the mRNA field (since the annotations are actually based off of the protein translation and not the mRNA - which has made sense to me). Perhaps this is an example of how the format has "evolved" over the years. My goal is to make it as automated as possible to annotate and submit a genome, as prior to funannotate, it would take me weeks/months to manually annotate problematic gene models, etc.

But I appreciate the feedback, if there are some more tools, searches, etc that you feel would benefit the annotation, I'll do my best to try to add more functionality. I'm in the process of working out an automated Trinity/PASA upgrade of the gene models to correctly capture UTRs, as the current implementation doesn't do that at all. I'm also working on a curated database of gene product descriptions, currently all gene models are listed as 'hypothetical protein' because of the strict rules with naming genes and products in the annotation, its difficult to do this automatically.

Jon

gskrasnov commented 6 years ago

Thank you, Jon

This is a wonderful pipeline greatly simplifies the process of genome annotation.

Is this OK that some EggNOG otholog annotations are not passed even to CDS? For example:

     CDS             join(5086..6132,6937..7557)
                     /locus_tag="FUN_000001"
                     /note="EggNog:ENOG411DQGU;
                     COG:Q"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:FUN_000001-T1"

In the EggNOG emapper output I see:

FUN_000001  3694.POPTR_0008s21950.1 4.6e-165    542.3       GO:0005575,GO:0016020   map00363,map00624,map00627,map00903,map00945,map01100,map01110,map01120 virNOG[6]   1BFD4@strNOG,1DQGU@virNOG,COG2124@NOG,KOG0156@euNOG 1BFD4|4.20823705792e-201|669.263427734  Q   cytochrome P450

e.g. annotation "cytochrome P450" is not passed to CDS. However this is very "high-level" annotation: cytochrome P450 family consists of dozens of members.

How can I cite funannotate in a paper?

George

nextgenusfs commented 6 years ago

Right, that's what I said I was working on -> a manually curated reference DB to extract gene names and product definitions. If you try to capture all of those, you will end up with hundreds of models that don't pass NCBI tbl2asn specs. My current plan is to create a tsv file of valid gene names with product definitions, so that when you use eggnog mapper it will pull a pre-screened product definition that will pass tbl2asn. Hopefully this will be available in next major release.

Working on a manuscript, but nothing is published yet. Right now you can cite the GitHub repo, I'll update it when citation is available.

gskrasnov commented 6 years ago

Thank you very much, Jon!