Open monkfromouterspace opened 4 months ago
To isolate the issue I also ran the individual commands that funannotate train failed at - please see below for the commands and their error outputs.
minimap2 -ax splice -t 80 --cs -u b -G 3000 trained/training/genome.fasta trained/training/trinity.fasta.clean -o test_sam.sam 2>test_minimap.err
stderr file:
[M::mm_idx_gen::174.663*1.73] collected minimizers
[M::mm_idx_gen::307.671*2.18] sorted minimizers
[WARNING][1;31m For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.[0m
[M::main::307.671*2.18] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::312.222*2.16] mid_occ = 1863
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 6
[M::mm_idx_stat::315.100*2.15] distinct minimizers: 267688257 (23.65% are singletons); average occurrences: 10.560; average spacing: 2.930; total length: 8283888652
[M::worker_pipeline::340.517*6.81] mapped 237187 sequences
[M::mm_idx_gen::535.748*4.95] collected minimizers
[M::mm_idx_gen::540.879*5.42] sorted minimizers
[M::main::540.879*5.42] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::540.879*5.42] mid_occ = 1863
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 6
[M::mm_idx_stat::543.575*5.40] distinct minimizers: 267117648 (23.66% are singletons); average occurrences: 10.583; average spacing: 2.930; total length: 8284033979
[M::worker_pipeline::568.211*7.90] mapped 237187 sequences
[M::mm_idx_gen::643.386*7.15] collected minimizers
[M::mm_idx_gen::644.360*7.25] sorted minimizers
[M::main::644.360*7.25] loaded/built the index for 3 target sequence(s)
[M::mm_mapopt_update::644.360*7.25] mid_occ = 1863
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 3
[M::mm_idx_stat::645.879*7.23] distinct minimizers: 184516954 (38.81% are singletons); average occurrences: 4.702; average spacing: 2.931; total length: 2542585960
[M::worker_pipeline::661.994*8.58] mapped 237187 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -ax splice -t 80 --cs -u b -G 3000 -o test_sam.sam trained/training/genome.fasta trained/training/trinity.fasta.clean
[M::main] Real time: 665.888 sec; CPU: 5681.954 sec; Peak RSS: 95.201 GB
samtools sort --reference trained/training/genome.fasta -@ 80 -o trained/training/trinity.alignments.bam test_sam.sam &
stderr file:
[E::sam_parse1] no SQ lines present in the header
samtools sort: truncated file. Aborting
[E::sam_parse1] no SQ lines present in the header
This seems outside funannotate but a query to those tools
If you can let us know how to run it properly with the larger reference I can probably update the minimap2 command accordingly. Also would be helpful to understand the size of the genome that requires a split index.
Thank you for your quick reply. It seems that there are two ways to resolve this, according to this page:
Changing the number of bases that are stored in RAM for indexing, by increasing the value of -I
(8G
by default now, as of Minimap2-2.25). Uses more memory but is quicker
Including --split-prefix
in the minimap2 command with a prefix for the temporary files that it generates, e.g. --split-prefix=tmp
. Uses less memory but is slower and requires disk space
Are you using the latest release? Yes.
Describe the bug I am trying to annotate a very large genome (~28Gb!). Funannotate train is failing at the minimap2 step which, from running the isolated command, seems to be because the genome is being converted to a multi-part index. Samtools is then failing because the resulting file has no @SQ lines in the header. Normally you can add another flag to minimap2 to resolve this but I would have to edit the .py files for this and want to check if there is another way first. Thank you in advance for your assistance.
What command did you issue?
funannotate train -i genome/aLisVul1.pri.asm.20230818_masked_scaffold-1-2-split.fa -o trained/ -l transcriptome/newt_testis_trimmed_1.fq.gz transcriptome/newt_F-liver_trimmed_1.fq.gz transcriptome/newt_ovary_trimmed_1.fq.gz transcriptome/newt_UK-liver_trimmed_1.fq.gz -r transcriptome/newt_testis_trimmed_2.fq.gz transcriptome/newt_F-liver_trimmed_2.fq.gz transcriptome/newt_ovary_trimmed_2.fq.gz transcriptome/newt_UK-liver_trimmed_2.fq.gz --trinity transcriptome/Lvulg_x_Lmont_tgm_reference_transcriptome.fa --species "Lissotriton vulgaris" --cpus 80 --no_trimmomatic --memory 500G >trained/Lvulg_funannotate_run1-train.stdout 2>trained/Lvulg_funannotate_run1-train.err &
Logfiles funannotate-train.log
OS/Install Information
output of
funannotate check --show-versions
You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed
Checking Environmental Variables... $PASAHOME=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/pasa-2.5.3 $TRINITY_HOME=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/trinity-2.8.5 $EVM_HOME=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/data/tigrr/home/user/anaconda3/envs/funannotate_env1/config/ ERROR: FUNANNOTATE_DB not set. export FUNANNOTATE_DB=/path/to/dir ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir
Checking external dependencies... PASA: 2.5.3 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v35 diamond: 2.1.8 emapper.py: [31mThere was an error retrieving eggnog-mapper DB data: not a valid file "/data/tigrr/home/user/anaconda3/envs/funannotate_env1/lib/python3.8/site-packages/data/eggnog.db"[0m [0mMaybe you need to run download_eggnog_data.py[0m emapper-2.1.12 / Expected eggNOG DB version: 5.0.2 / Installed eggNOG DB version: unknown / Diamond found: diamond 2.1.8 / MMseqs2 found: 14.7e284 / Compatible novel families DB version: 1.0.1
ete3: 3.1.3 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-10-10 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.1+ minimap2: 2.26-r1175 pigz: 2.8 proteinortho: 6.3.0 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.16.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.12 (Nov 2022) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.1+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed ERROR: signalp not installed