nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

[Question] Clarification for running pipeline on a novel protist #312

Closed jolespin closed 4 years ago

jolespin commented 4 years ago

I'm trying to do following to assemble a diatom:

(1) Kneaddata/Trimmomatic for my transcript reads; (2) Mapping the reads to my unmasked genomes; (3) Using RNA-Spades for the transcript assembly using the reads from [2]; (4) funannotate train for each of my libraries ( I have a lot); (5) Use the following to predict genes using funannotate predict: (5a) All of my training files from [4] ( I believe these will be gff3 files from funannotate train) --pasa_gff ; (5b) trusted proteins from the nearest organism --protein_evidence; (5c) A merged map file for all of my transcriptomes

My questions:

(1) Should the above protocol work? (2) Can I use multuple --rna_bam? (3) Does my --rna_bam need to be aligned to the repeat masked assembly? (4) Does my funannotate train step need to be on the repeat masked assembly?

jolespin commented 4 years ago

Hmmm.... I got some progress (I think).

(funannotate_env) -bash-4.1$ conda deactivate
conda activate (base) -bash-4.1$ conda activate funannotate_env
----------------------------------
Activating Funannotate Environment
----------------------------------
GeneMark-ES_ET-4.46 license already exists: /home/jespinoz/.gm_key
Please delete the file and reactivate environment if you to overwrite the license file.
(funannotate_env) -bash-4.1$ ls -lh /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/bin/ | grep "emapper.py"
lrwxrwxrwx  1 jespinoz tigr    76 Aug 21 04:08 emapper.py -> /usr/local/devel/ANNOTATION/jespinoz/Packages/eggnog-mapper-1.0.3/emapper.py
(funannotate_env) -bash-4.1$ funannotate check –show-versions
-------------------------------------------------------
Checking dependencies for funannotate v1.6.0-df4262f
-------------------------------------------------------
You are running Python v 2.7.15. Now checking python packages...
All 11 python packages installed

You are running Perl v 5.026002. Now checking perl modules...
All 27 Perl modules installed

Checking external dependencies...
    ERROR: emapper.py not installed
Checking Environmental Variables...
All 7 environmental variables are set
-------------------------------------------------------

Is there a particular place funannotate is looking for emapper.py?

nextgenusfs commented 4 years ago

It’s literally calling which emapper.py, well something similar to that. It just tries to run the command and get the version.

jolespin commented 4 years ago

Interesting. I wonder if it would correctly call it in the actual pipeline since it's in the path via export PATH=${OPT}/pasa-2.3.3/bin:${PACKAGES}/RepeatModeler-1.0.11:${PACKAGES}/GeneMark-ES_ET-4.46/gm_et_linux_64/:${PACKAGES}/eggnog-mapper-1.0.3:$PATH

I tried running the test but got an error at one of the stages:

(funannotate_env) -bash-4.1$ funannotate test -t predict rna-seq annotate --cpus 4
#########################################################
Running `funannotate predict` unit testing
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --augustus_species saccharomyces --cpus 4 --species Awesome testicus
#########################################################
-------------------------------------------------------
[07:09 PM]: OS: linux2, 4 cores, ~ 8 GB RAM. Python: 2.7.15
[07:09 PM]: Running funannotate v1.6.0-df4262f
[07:09 PM]: Augustus training set for saccharomyces already exists. To re-train provide unique --augustus_species argument
[07:09 PM]: AUGUSTUS (3.3) detected, version seems to be compatible with BRAKER and BUSCO
[07:09 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/multiprocessing/util.py", line 277, in _run_finalizers
    finalizer()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/multiprocessing/util.py", line 207, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/shutil.py", line 266, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/shutil.py", line 264, in rmtree
    os.remove(fullname)
OSError: [Errno 16] Device or resource busy: '/usr/local/scratch/METAGENOMICS/jespinoz/TMPDIR/pymp-un30va/.nfs000000013ae1f2f9000026b8'
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/multiprocessing/util.py", line 277, in _run_finalizers
    finalizer()
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/multiprocessing/util.py", line 207, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/shutil.py", line 266, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/funannotate_env/lib/python2.7/shutil.py", line 264, in rmtree
    os.remove(fullname)
OSError: [Errno 16] Device or resource busy: '/usr/local/scratch/METAGENOMICS/jespinoz/TMPDIR/pymp-bgi2E7/.nfs000000013ae1f2fb000026b9'
[07:09 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[07:09 PM]: Mapping proteins to genome using Diamond blastx/Exonerate
[07:09 PM]: Using 1,065 proteins as queries
[07:09 PM]: Running Diamond pre-filter search
[07:09 PM]: Found 1,774 preliminary alignments
[07:10 PM]: Exonerate finished: found 1,347 alignments
[07:10 PM]: Running GeneMark-ES on assembly
[07:13 PM]: Converting GeneMark GTF file to GFF3
[07:13 PM]: Found 1,540 gene models
[07:13 PM]: Running BUSCO to find conserved gene models for training Augustus
[07:13 PM]: Multi-threading in tblastn v2.6.0 is unstable, running in single threaded mode for BUSCO
[07:14 PM]: BUSCO training of Augusus failed, check busco logs, exiting
#########################################################
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/Packages/Funannotate-1.6.0/bin/funannotate-test.py", line 334, in <module>
    runPredictTest()
  File "/usr/local/devel/ANNOTATION/jespinoz/Packages/Funannotate-1.6.0/bin/funannotate-test.py", line 168, in runPredictTest
    assert 1500 <= countGFFgenes(os.path.join(tmpdir, 'annotate', 'predict_results', 'Awesome_testicus.gff3')) <= 1800
  File "/usr/local/devel/ANNOTATION/jespinoz/Packages/Funannotate-1.6.0/bin/funannotate-test.py", line 58, in countGFFgenes
    with open(input, 'rU') as f:
IOError: [Errno 2] No such file or directory: 'test-predict_224980/annotate/predict_results/Awesome_testicus.gff3'

Here's the busco.log head:

``` (funannotate_env) -bash-4.1$ head -n 28 test-predict_224980/annotate/logfiles/busco.log INFO ****************** Start a BUSCO 2.0 analysis, current time: 08/21/2019 19:13:55 ****************** INFO The lineage dataset is: dikarya_odb9 (eukaryota) INFO Mode is: genome INFO Maximum number of regions limited to: 3 INFO To reproduce this run: python /usr/local/devel/ANNOTATION/jespinoz/Packages/Funannotate-1.6.0/util/funannotate-BUSCO2.py -i /local/ifs3_scratch/METAGENOMICS/jespinoz/TMPDIR/funannotate_output/test-predict_224980/annotate/predict_misc/genome.softmasked.fa -o saccharomyces -l /usr/local/scratch/METAGENOMICS/jespinoz/db/funannotate_db/dikarya/ -m genome -c 4 -sp anidulans INFO Check dependencies... INFO Check input file... INFO Temp directory is ./tmp/ INFO ****** Phase 1 of 2, initial predictions ****** INFO ****** Step 1/3, current time: 08/21/2019 19:13:55 ****** INFO Create blast database... INFO [makeblastdb] Building a new DB, current time: 08/21/2019 19:13:55 INFO [makeblastdb] New DB name: /local/ifs3_scratch/METAGENOMICS/jespinoz/TMPDIR/funannotate_output/test-predict_224980/annotate/predict_misc/busco/tmp/saccharomyces_1878864831 INFO [makeblastdb] New DB title: /local/ifs3_scratch/METAGENOMICS/jespinoz/TMPDIR/funannotate_output/test-predict_224980/annotate/predict_misc/genome.softmasked.fa INFO [makeblastdb] Sequence type: Nucleotide INFO [makeblastdb] Keep MBits: T INFO [makeblastdb] Maximum file size: 1000000000B INFO [makeblastdb] Adding sequences from FASTA; added 6 sequences in 0.0470541 seconds. INFO Running tblastn, writing output to /local/ifs3_scratch/METAGENOMICS/jespinoz/TMPDIR/funannotate_output/test-predict_224980/annotate/predict_misc/busco/run_saccharomyces/blast_output/tblastn_saccharomyces.tsv... INFO ****** Step 2/3, current time: 08/21/2019 19:14:16 ****** INFO Getting coordinates for candidate regions... INFO Pre-Augustus scaffold extraction... INFO Running Augustus prediction using anidulans as species: INFO [augustus] Please find all logs related to Augustus here: /local/ifs3_scratch/METAGENOMICS/jespinoz/TMPDIR/funannotate_output/test-predict_224980/annotate/predict_misc/busco/run_saccharomyces/augustus_output/augustus.log INFO 08/21/2019 19:14:16 => 0% of predictions performed (743 to be done) INFO [augustus] /bin/sh: line 1: 233534 Segmentation fault (core dumped) augustus --stopCodonExcludedFromCDS=False --codingseq=1 --proteinprofile=/usr/local/scratch/METAGENOMICS/jespinoz/db/funannotate_db/dikarya/prfl/EOG092644X6.prfl --predictionStart=109668 --predictionEnd=120165 --species=anidulans './tmp/CP022972.1saccharomyces_1878864831_.temp' > /local/ifs3_scratch/METAGENOMICS/jespinoz/TMPDIR/funannotate_output/test-predict_224980/annotate/predict_misc/busco/run_saccharomyces/augustus_output/predicted_genes/EOG092644X6.out.1 2>> /local/ifs3_scratch/METAGENOMICS/jespinoz/TMPDIR/funannotate_output/test-predict_224980/annotate/predict_misc/busco/run_saccharomyces/augustus_output/augustus.log ```

Haha, I feel like I'm SO close to getting this to work smoothly for this diatom. I've learned A LOT about conda environments, perl libraries, and sourcing scripts during this process.

nextgenusfs commented 4 years ago

Might be the same write permissions error for Augustus.......?

Oh okay didn’t see the core dump. That looks like an Augustus compilation error? The proteinprofile mode is sensitive to compiler and it is inconsistent between versions. Although it seems to have passed the funannotate test. Is this a different version/binary of Augustus? Maybe copy the other version that you had working to this location?

nextgenusfs commented 4 years ago

Emapper is it required so if it isn’t in path when running annotate it will just skip that step.

jolespin commented 4 years ago

So that Segmentation fault (core dumped) from augustus is caused from not being able to write files?

I checked which files were available the tmp directory:

(funannotate_env) -bash-4.1$ ls test-predict_202108/annotate/predict_misc/busco/tmp/
CP022970.1saccharomyces_4117535203_.temp  CP022972.1saccharomyces_4117535203_.temp  CP022974.1saccharomyces_4117535203_.temp  saccharomyces_4117535203.nhr  saccharomyces_4117535203.nsq
CP022971.1saccharomyces_4117535203_.temp  CP022973.1saccharomyces_4117535203_.temp  CP022975.1saccharomyces_4117535203_.temp  saccharomyces_4117535203.nin

I then tried calling augustus externally:

augustus --stopCodonExcludedFromCDS=False --codingseq=1 --proteinprofile=/usr/local/scratch/METAGENOMICS/jespinoz/db/funannotate_db/dikarya/prfl/EOG092644X6.prfl --predictionStart=109668 --predictionEnd=120165 --species=anidulans test-predict_202108/annotate/predict_misc/busco/tmp/CP022970.1saccharomyces_4117535203_.temp
Segmentation fault (core dumped)

Could it be a faulty installation of augustus?

nextgenusfs commented 4 years ago

Looks like compilation error, so yes probably install error.

nextgenusfs commented 4 years ago

Conda package now available -- should fix all these install errors (I hope).

jolespin commented 4 years ago

This is awesome. Thank you so much for doing this! I literally just got my complete run finished the other day for my diatom.