nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Samtools sort failing in funannotate train because of multi-part indexing by minimap2 #1006

Open monkfromouterspace opened 4 months ago

monkfromouterspace commented 4 months ago

Are you using the latest release? Yes.

Describe the bug I am trying to annotate a very large genome (~28Gb!). Funannotate train is failing at the minimap2 step which, from running the isolated command, seems to be because the genome is being converted to a multi-part index. Samtools is then failing because the resulting file has no @SQ lines in the header. Normally you can add another flag to minimap2 to resolve this but I would have to edit the .py files for this and want to check if there is another way first. Thank you in advance for your assistance.

What command did you issue? funannotate train -i genome/aLisVul1.pri.asm.20230818_masked_scaffold-1-2-split.fa -o trained/ -l transcriptome/newt_testis_trimmed_1.fq.gz transcriptome/newt_F-liver_trimmed_1.fq.gz transcriptome/newt_ovary_trimmed_1.fq.gz transcriptome/newt_UK-liver_trimmed_1.fq.gz -r transcriptome/newt_testis_trimmed_2.fq.gz transcriptome/newt_F-liver_trimmed_2.fq.gz transcriptome/newt_ovary_trimmed_2.fq.gz transcriptome/newt_UK-liver_trimmed_2.fq.gz --trinity transcriptome/Lvulg_x_Lmont_tgm_reference_transcriptome.fa --species "Lissotriton vulgaris" --cpus 80 --no_trimmomatic --memory 500G >trained/Lvulg_funannotate_run1-train.stdout 2>trained/Lvulg_funannotate_run1-train.err &

Logfiles funannotate-train.log

[02/29/24 16:04:30]: /data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/funannotate train -i genome/aLisVul1.pri.asm.20230818_masked_scaffold-1-2-split.fa -o trained/ -l transcriptome/newt_testis_trimmed_1.fq.gz transcriptome/newt_F-liver_trimmed_1.fq.gz transcriptome/newt_ovary_trimmed_1.fq.gz transcriptome/newt_UK-liver_trimmed_1.fq.gz -r transcriptome/newt_testis_trimmed_2.fq.gz transcriptome/newt_F-liver_trimmed_2.fq.gz transcriptome/newt_ovary_trimmed_2.fq.gz transcriptome/newt_UK-liver_trimmed_2.fq.gz --trinity transcriptome/Lvulg_x_Lmont_tgm_reference_transcriptome.fa --species Lissotriton vulgaris --cpus 80 --no_trimmomatic --memory 500G

[02/29/24 16:04:30]: OS: Ubuntu 22.04, 240 cores, ~ 1915 GB RAM. Python: 3.8.15
[02/29/24 16:04:30]: Running 1.8.16
[02/29/24 16:04:30]: fasta version=36.3.8g path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/fasta
[02/29/24 16:04:30]: minimap2 version=2.26-r1175 path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/minimap2
[02/29/24 16:04:30]: hisat2 version=2.2.1 path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/hisat2
[02/29/24 16:04:30]: hisat2-build version=NA path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/hisat2-build
[02/29/24 16:04:30]: Trinity version=2.8.5 path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/Trinity
[02/29/24 16:04:30]: java version=17.0.3-internal path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/java
[02/29/24 16:04:30]: kallisto version=0.46.1 path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/kallisto
[02/29/24 16:04:30]: /data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3/Launch_PASA_pipeline.pl version=NA path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3/Launch_PASA_pipeline.pl
[02/29/24 16:04:30]: /data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3/bin/seqclean version=NA path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3/bin/seqclean
[02/29/24 16:04:30]: minimap2 version=2.26-r1175 path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/minimap2
[02/29/24 16:04:30]: blat version=BLAT v35 path=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/bin/blat
[02/29/24 16:05:48]: Multiple inputs for --left and --right detected, concatenating PE reads
[02/29/24 16:05:48]: cat /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_testis_trimmed_1.fq.gz /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_F-liver_trimmed_1.fq.gz /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_ovary_trimmed_1.fq.gz /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_UK-liver_trimmed_1.fq.gz
[02/29/24 16:05:53]: cat /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_testis_trimmed_2.fq.gz /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_F-liver_trimmed_2.fq.gz /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_ovary_trimmed_2.fq.gz /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/transcriptome/newt_UK-liver_trimmed_2.fq.gz
[02/29/24 16:06:00]: Input reads: ('trained/training/left.fq.gz', 'trained/training/right.fq.gz', None)
[02/29/24 16:06:00]: Trimmomatic will be skipped
[02/29/24 16:06:00]: Quality trimmed reads: ('trained/training/left.fq.gz', 'trained/training/right.fq.gz', None)
[02/29/24 16:06:00]: FASTQ headers seem compatible with Trinity
[02/29/24 16:06:00]: Read normalization will be skipped
[02/29/24 16:06:00]: Normalized reads: ('trained/training/left.fq.gz', 'trained/training/right.fq.gz', None)
[02/29/24 16:06:00]: Long reads: (None, None, None)
[02/29/24 16:06:00]: Long reads FASTA format: (None, None, None)
[02/29/24 16:06:00]: Long SeqCleaned reads: (None, None, None)
[02/29/24 16:06:00]: Parsing assembled trinity data : transcriptome/Lvulg_x_Lmont_tgm_reference_transcriptome.fa
[02/29/24 16:06:01]: Removing poly-A sequences from trinity transcripts using seqclean
[02/29/24 16:06:01]: /data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3/bin/seqclean trinity.fasta -c 16
[02/29/24 16:06:15]: seqclean running options: 
seqclean trinity.fasta -c 16
 Standard log file: seqcl_trinity.fasta.log
 Error log file:    err_seqcl_trinity.fasta.log
 Using 16 CPUs for cleaning
-= Rebuilding trinity.fasta cdb index =-
 Launching actual cleaning process:
 psx -p 16  -n 1000  -i trinity.fasta -d cleaning -C '/data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs/trinity.fasta:ANLMS100:::11:0' -c '/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3/bin/seqclean.psx'
Collecting cleaning reports

**************************************************
Sequences analyzed:    237189
-----------------------------------
                   valid:    237187  (2305 trimmed)
                 trashed:         2
**************************************************
----= Trashing summary =------
                by 'dust':        2
------------------------------
Output file containing only valid and trimmed sequences: trinity.fasta.clean
For trimming and trashing details see cleaning report  : trinity.fasta.cln
--------------------------------------------------
seqclean (trinity.fasta) finished on machine 
 in /data/tigrr/home/user/analyses/genome_annotation/de_novo/smooth_newt/inputs, without a detectable error.

[02/29/24 16:06:15]: minimap2 -ax splice -t 80 --cs -u b -G 3000 trained/training/genome.fasta trained/training/trinity.fasta.clean | samtools sort --reference trained/training/genome.fasta -@ 4 -o trained/training/trinity.alignments.bam -
[02/29/24 16:10:54]: Alignment failed, BAM files empty. Please check logfile

OS/Install Information

You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed

Checking Environmental Variables... $PASAHOME=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3 $TRINITY_HOME=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/trinity-2.8.5 $EVM_HOME=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/config/ ERROR: FUNANNOTATE_DB not set. export FUNANNOTATE_DB=/path/to/dir ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... PASA: 2.5.3 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v35 diamond: 2.1.8 emapper.py: There was an error retrieving eggnog-mapper DB data: not a valid file "/data/tigrr/home/user/anaconda3/envs/funannotate_env1/lib/python3.8/site-packages/data/eggnog.db" Maybe you need to run download_eggnog_data.py emapper-2.1.12 / Expected eggNOG DB version: 5.0.2 / Installed eggNOG DB version: unknown / Diamond found: diamond 2.1.8 / MMseqs2 found: 14.7e284 / Compatible novel families DB version: 1.0.1

ete3: 3.1.3 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-10-10 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.1+ minimap2: 2.26-r1175 pigz: 2.8 proteinortho: 6.3.0 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.16.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.12 (Nov 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.1+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed ERROR: signalp not installed

monkfromouterspace commented 4 months ago

To isolate the issue I also ran the individual commands that funannotate train failed at - please see below for the commands and their error outputs.

minimap2 -ax splice -t 80 --cs -u b -G 3000 trained/training/genome.fasta trained/training/trinity.fasta.clean -o test_sam.sam 2>test_minimap.err

stderr file:

[M::mm_idx_gen::174.663*1.73] collected minimizers
[M::mm_idx_gen::307.671*2.18] sorted minimizers
[WARNING] For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.
[M::main::307.671*2.18] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::312.222*2.16] mid_occ = 1863
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 6
[M::mm_idx_stat::315.100*2.15] distinct minimizers: 267688257 (23.65% are singletons); average occurrences: 10.560; average spacing: 2.930; total length: 8283888652
[M::worker_pipeline::340.517*6.81] mapped 237187 sequences
[M::mm_idx_gen::535.748*4.95] collected minimizers
[M::mm_idx_gen::540.879*5.42] sorted minimizers
[M::main::540.879*5.42] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::540.879*5.42] mid_occ = 1863
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 6
[M::mm_idx_stat::543.575*5.40] distinct minimizers: 267117648 (23.66% are singletons); average occurrences: 10.583; average spacing: 2.930; total length: 8284033979
[M::worker_pipeline::568.211*7.90] mapped 237187 sequences
[M::mm_idx_gen::643.386*7.15] collected minimizers
[M::mm_idx_gen::644.360*7.25] sorted minimizers
[M::main::644.360*7.25] loaded/built the index for 3 target sequence(s)
[M::mm_mapopt_update::644.360*7.25] mid_occ = 1863
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 3
[M::mm_idx_stat::645.879*7.23] distinct minimizers: 184516954 (38.81% are singletons); average occurrences: 4.702; average spacing: 2.931; total length: 2542585960
[M::worker_pipeline::661.994*8.58] mapped 237187 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -ax splice -t 80 --cs -u b -G 3000 -o test_sam.sam trained/training/genome.fasta trained/training/trinity.fasta.clean
[M::main] Real time: 665.888 sec; CPU: 5681.954 sec; Peak RSS: 95.201 GB

samtools sort --reference trained/training/genome.fasta -@ 80 -o trained/training/trinity.alignments.bam test_sam.sam &

stderr file:

[E::sam_parse1] no SQ lines present in the header
samtools sort: truncated file. Aborting
[E::sam_parse1] no SQ lines present in the header
hyphaltip commented 3 months ago

This seems outside funannotate but a query to those tools

nextgenusfs commented 3 months ago

If you can let us know how to run it properly with the larger reference I can probably update the minimap2 command accordingly. Also would be helpful to understand the size of the genome that requires a split index.

monkfromouterspace commented 3 months ago

Thank you for your quick reply. It seems that there are two ways to resolve this, according to this page: