nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

fail to run funannotate #173

Closed MinjieHu closed 6 years ago

MinjieHu commented 6 years ago

Dear Jon, After I succeed in run sample data, I still failed run my assembly. When I followed the tutorial to run funannotate train -i xenia.contigs.fasta -o fun -s ../../../transcriptome_based_first3_genome/coral_RNA.fastq --species "Xenia" --cpus 18 It always stalked at ESC[92m[01:50:15 PM]ESC[0m: Assembling 120,054 Trinity clusters using 17 CPUs, Progress: 10.20% So I tried to provide bam file by using STAR to align my RNA-seq data to my assembly, and run funannotate predict. It still failed with log

[0m: Exonerate finished: found 3,474 alignments
ESC[92m[09:49:45 AM]ESC[0m: Failed exonerate alignments found, see files in p2g_17578/failed
ESC[92m[09:49:46 AM]ESC[0m: Now launching BRAKER to train GeneMark and Augustus
ESC[92m[03:12:25 PM]ESC[0m: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf, proceeding with only Augustus predictions.
ESC[92m[03:12:25 PM]ESC[0m: Augustus prediction failed, check `logfiles/augustus-parallel.log`

There is no augustus-parallel.log in the logfile folder. And when I look in to the braker.log, I also didn't find obvious error message, and the last few lines are

May251104492018/round-4/blastdbcmd.log
# Sun May 27 15:12:22 2018: deleting job lst files (if existing)
rm /mnt/sequence/mhu2/git/mhu2-xelongata/nanopore/final/polished/fun_1.3/predict_misc/augustus.hints.tmp.gtf
rm /mnt/sequence/mhu2/git/mhu2-xelongata/nanopore/final/polished/fun_1.3/predict_misc/hints.job.lst
rm /mnt/sequence/mhu2/git/mhu2-xelongata/nanopore/final/polished/fun_1.3/predict_misc/aug_hints.lst

And it's the same case for the gmes.log file. The last several lines are

soft/gm_et_linux_64/gmes_petap/hmm_to_gtf.pl  --in /mnt/sequence/mhu2/git/mhu2-xelongata/nanopore/final/polished/fun_1.3/predict_misc/GeneMark-ET/output/gmhmm/dna.fa_3646.out  --app  --out genemark.gtf  --min 300 
/mnt/sequence/mhu2/soft/gm_et_linux_64/gmes_petap/gmes_petap.pl : [Sun May 27 13:00:07 2018] /mnt/sequence/mhu2/soft/gm_et_linux_64/gmes_petap/hmm_to_gtf.pl  --in /mnt/sequence/mhu2/git/mhu2-xelongata/nanopore/final/polished/fun_1.3/predict_misc/GeneMark-ET/output/gmhmm/dna.fa_273.out  --app  --out genemark.gtf  --min 300 
/mnt/sequence/mhu2/soft/gm_et_linux_64/gmes_petap/gmes_petap.pl : [Sun May 27 13:00:08 2018] /mnt/sequence/mhu2/soft/gm_et_linux_64/gmes_petap/hmm_to_gtf.pl  --in /mnt/sequence/mhu2/git/mhu2-xelongata/nanopore/final/polished/fun_1.3/predict_misc/GeneMark-ET/output/gmhmm/dna.fa_154.out  --app  --out genemark.gtf  --min 300 
/mnt/sequence/mhu2/soft/gm_et_linux_64/gmes_petap/gmes_petap.pl : [Sun May 27 13:00:11 2018] /mnt/sequence/mhu2/soft/gm_et_linux_64/gmes_petap/reformat_gff.pl --out genemark.gtf.tmp --trace info/dna.trace --in genemark.gtf  --back
/mnt/sequence/mhu2/soft/gm_et_linux_64/gmes_petap/gmes_petap.pl : [Sun May 27 13:00:11 2018] mv genemark.gtf.tmp genemark.gtf

Thanks for the help!

nextgenusfs commented 6 years ago

Would you be able to share the log files so I can see th entire log for predict as well as braker?

MinjieHu commented 6 years ago

Of course. braker.log funannotate-p2g.log funannotate-predict.log

nextgenusfs commented 6 years ago

Looks like you are running v1.0.0, can you upgrade to the newest version as quite a bit has changed especially with the RNA-seq modules.

MinjieHu commented 6 years ago

Thanks for the reply. I will try with the 1.3.3 version.

By the way, is it ok to predict with STAR aliment based Bam file?

nextgenusfs commented 6 years ago

Yeah I think it should be okay with star although running PASA is also quite helpful. You can also run that separately and pass the transdecoder filtered PASA models to the predict script.

MinjieHu commented 6 years ago

Great. Predict works now. Thanks for the help. But for train, the progress from last night is 2.51%, and right now, it's still 2.51%

nextgenusfs commented 6 years ago

Any clues in the trinity log file? Should be in the 'training/trinity_gg.log` file.

MinjieHu commented 6 years ago

It just shows "All commands completed successfully. :-)". And it already produced a bam file "hisat2.coordSorted.bam" with 2.2 G size. While in my STAR based alignment, the sorted bam file size is 3.2G. I have no idea whether it is completed or not. I paste all the log files as following. funannotate-train.log Trinity-gg.log nohup.log

nextgenusfs commented 6 years ago

I'm not sure, but it doesn't look like Trinity is running? Seems like the log file is just full of this error:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "zh_CN.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
perl: warning: Setting locale failed.
perl: warning: Setting locale failed.
perl: warning: Setting locale failed.
perl: warning: Setting locale failed.

Perhaps addressing that error will allow it to run.

MinjieHu commented 6 years ago

You are right. When I look into the detail of the Trinity-gg.log, before the perl warning appeared, it shows

###################################################################
##  Stopping here due to --no_distributed_trinity_exec in effect ##
###################################################################

But I still don't know how to deal with it.

nextgenusfs commented 6 years ago

Shouldn't it just be setting the correct environmental variable? https://stackoverflow.com/questions/2499794/how-to-fix-a-locale-setting-warning-from-perl

MinjieHu commented 6 years ago

By the way, I tried to just skip the update step, and went to the annotation step. It can successful finished the annotation step. But there's a lot of genes without typical gene name. Do you only assign a gene name when it have a high confidence blast hit?

nextgenusfs commented 6 years ago

Did you run EggNog mapper as well? Many if not most genes won't have names/product descriptions - it will pull eggnog mapper names as well as UniProt/Swissprot (60% pident over 60% percent of the protein) - so it is designed to be conservative.

MinjieHu commented 6 years ago

I fixed the warning of perl. But when I run update, the problem is still there.

Trinity version: Trinity-v2.5.1
Tuesday, May 29, 2018: 17:55:10 CMD: /mnt/sequence/mhu2/miniconda2/opt/trinity-2.5.1/util/support_scripts/ensure_coord_sorted_sam.pl funannotate/update_misc/hisat2.coordSorted.bam
** NOTE: Latest version of Trinity is Trinity-v2.6.6, and can be obtained at:
        https://github.com/trinityrnaseq/trinityrnaseq/releases

-appears to be a coordinate sorted bam file. ok.
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
Tuesday, May 29, 2018: 17:55:10 CMD: java -Xmx64m -XX:ParallelGCThreads=2  -jar /mnt/sequence/mhu2/miniconda2/opt/trinity-2.5.1/util/support_scripts/ExitTester.jar 0
Tuesday, May 29, 2018: 17:55:11 CMD: java -Xmx64m -XX:ParallelGCThreads=2  -jar /mnt/sequence/mhu2/miniconda2/opt/trinity-2.5.1/util/support_scripts/ExitTester.jar 1

----------------------------------------------------------------------------------
-------------- Trinity Phase 1: Clustering of RNA-Seq Reads  ---------------------
----------------------------------------------------------------------------------

Tuesday, May 29, 2018: 17:55:11 CMD: samtools index /mnt/sequence/mhu2/git/mhu2-xelongata/nanopore/final/polished/funannotate/update_misc/hisat2.coordSorted.bam

###################################################################
##  Stopping here due to --no_distributed_trinity_exec in effect ##
###################################################################

All commands completed successfully. :-)

All commands completed successfully. :-)

All commands completed successfully. :-)

All commands completed successfully. :-)

All commands completed successfully. :-)
MinjieHu commented 6 years ago

Based on the manual, if I installed EggNog mapper, it will automatically run, is it? I think I already installed EggNog mapper. I can run emapper.py directly from my bash.

nextgenusfs commented 6 years ago

What version of samtools are you using, seems like that is also causing an error in Trinity.

Per eggnog, yes as long as emapper.py is in the $PATH then it should run the analysis during funannotate annotate.

MinjieHu commented 6 years ago

Samtools Version: 1.7

nextgenusfs commented 6 years ago

Okay, well I think version of Trinity < 2.6 use a very old version of samtools packaged with Trinity, while newer versions use your system samtools, so you may want to look on the Trinity page for help on your installation -- i.e. ensure that you can run the Trinity sample data, etc. I think the most recent version is like 2.6.6 so you might consider upgrading.

MinjieHu commented 6 years ago

You are great! When I update Trinity, it works. Unfortunately, it failed at PASA step. it reported can't find PASA config.txt file. And I also look into the template config file, it seems to config mysql. I tried to install and config mysql. But it's hard for me to configure without root privilege. Is mysql indeed necessary, or I can just use sqlite instead? Thanks again for the great help!

nextgenusfs commented 6 years ago

In funannotate v1.3.0 and newer, the default will try to use SQLite, in fact you have to specify --pasa_db mysql for it to run mysql (and I see that the menu has not been updated to reflect this). You will also need to have the most recent version of PASA for it to be able to use SQLite. Why don't you move into the PASA install directory and run the packaged tests - there is both a test for SQLite and one for MySQL.

MinjieHu commented 6 years ago

Ok, I will give a try. By the way, Is it fine just copy the PASA template config file as the config.txt file?

nextgenusfs commented 6 years ago

Its only needed if you are using MySQL -- see here https://github.com/PASApipeline/PASApipeline/wiki/Pasa_installation_instructions

MinjieHu commented 6 years ago

Finally, I succeed in running update and annotate. And in the annotate step, it indeed run Eggnog-mapper, but it finished within less than 1 minute. I only get 442 gene/product names passed.

[07:07 PM]: 1,204 valid gene/product annotations from 1,626 total
[07:07 PM]: Running Eggnog-mapper
[07:07 PM]: No Eggnog-mapper results found.

The version I am using is from anaconda. It should be version 1.03 according to anaconda. But when I runemapper.py --version it shows emapper-a9fda72

MinjieHu commented 6 years ago

There is a uniprot_eggnog_raw_names.txt which has 1204 lines in annotate_misc folder. It seems eggnog indeed ran.

nextgenusfs commented 6 years ago

That's not the output of eggnog -- its a raw summary of the gene name/products that were parsed (so this file is created even if eggnog isn't run).

The problem is likely the diamond database - the version of diamond database distributed with eggnog-mapper is created with an old version of diamond and isn't compatible (v0.8x is not compatible with v0.9x). You can fix this by following the:

#navigate to the eggnog-mapper/data folder
#extract protein fasta files and then re-construct diamond database
mv eggnog_proteins.dmnd eggnog_proteins_old.dmnd
diamond getseq --db eggnog_proteins_old.dmnd | diamond makedb --db eggnog_proteins.dmnd
rm eggnog_proteins_old.dmnd
nextgenusfs commented 6 years ago

Also the funannotate-annotate.log files should have more information about potentially what the error was. A successful run in the log file looks like this:

06/01/18 11:16:33]: emapper.py -m diamond -i /Users/jon/funannotate/sample_data/genome3/annotate_misc/genome.proteins.fasta -o eggnog --cpu 6
[06/01/18 11:46:31]: #  emapper-1.0.3
# ./emapper.py  -m diamond -i /Users/jon/funannotate/sample_data/genome3/annotate_misc/genome.proteins.fasta -o eggnog --cpu 6
  /Users/jon/miniconda2/bin/diamond blastp -d /Users/jon/software/eggnog-mapper/data/eggnog_proteins.dmnd -q /Users/jon/funannotate/sample_data/genome3/annotate_misc/genome.proteins.fasta --more-sensitive --threads 6 -e 0.001000 -o /Users/jon/funannotate/sample_data/genome3/annotate_misc/emappertmp_dmdn_Z7MUcn/b8a125cec30b411f911799dead6693cc --top 3
Functional annotation of refined hits starts now
 Processed queries:147 total_time:2.02165222168 rate:72.71 q/s
Done
   eggnog.emapper.seed_orthologs
   eggnog.emapper.annotations
Total time: 1797.69 secs

And then you should have this output file: annotate_misc/eggnog.emapper.annotations.

MinjieHu commented 6 years ago

Thanks very much! It works now!

MinjieHu commented 6 years ago

When I look into the detail of the annotation, I found there's only 3,185 gene/product names passed, while the total predicted protein coding gene number is 26,791. Is such a low fraction normal? Can I change some parameters to improve the passed gene names number?

MinjieHu commented 6 years ago

I compared the eggnog mapper annotation with the final annotation, and found that a lot genes with eggnog name are assigned as hypothetical protein by the final annotation.

nextgenusfs commented 6 years ago

Yes -- many of those names are invalid and are filtered out. For example many are locus_tags from another organism, i.e.:

FG04299.1
FGSG_01554
PGUG_02518
.
.

These are not valid gene names (really they shouldn't be in EggNog - likely they are placeholders until a formal name is assigned). So the current criteria for filtering the gene names is: 1) cannot contain and underscore 2) cannot contain a period 3) at least 3 characters 4) one character has to be a number 5) but not more than 3 numbers

Some 'valid' names are probably dropped with this filtering -- but is necessary to remove all of the invalid names. As always, if somebody has a method to improve this that would be great.

MinjieHu commented 6 years ago

Thanks for the quick responding. I understand your concern. But in my case right now, I need much more functional annotation for my downstream single cell RNA-seq analysis. So I try to skip your filter criteria by annotate the if line if not '_' in cols[Genei] and not '.' in cols[Genei] and number_present(cols[Genei]): But I still failed to get more gene name. Is there somewhere else I missed?

nextgenusfs commented 6 years ago

Gene names aren't really that important for functional annotation (at least in my opinion) - but assuming you also ran interproscan and the rest of the tools, you should have functional annotation for many proteins (if you don't it might mean that the prediction step didn't work as planned). For fungi (the organisms I work on) I would only typically except to get gene names for ~ 15-20% of the genes -- as 60-80% of most fungal genes are "hypothetical" and there isn't generally a known function.

You can also do some of this manually, i.e. if you had a closely related well-annotated genome, you could transfer gene names/annotation by identifying orthologs between the two genomes and then add your desired gene names using the -a, --annotations option. You could also parse the eggnog mapper file if you wanted to and just need to generate a 3 column TSV file to pass to -a.

MinjieHu commented 6 years ago

Thanks for the suggestion. I have another question here. In the annotate_misc/all.annotations.txt file, I got ~15000 genes with name information, but only ~5000 name in the final annotation, and this difference seems come from the tbl2asn step, do you do some other filtration during this step?

nextgenusfs commented 6 years ago

closing this because there are several issues in here, if one/more issues arise please open a new ticket with a single issue per thread.