nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

minimap2 failed with '[WARNING] For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.' #294

Closed songtaogui closed 4 years ago

songtaogui commented 5 years ago

Are you using the latest release? Yes, I am using version: 1.5.3-21ad095

Describe the bug minimap2 failed during aligning transcript.fa to the genome with logs below.

What command did you issue? funannotate predict

Logfiles

[05/24/19 11:18:10]: /public/home/stgui/.linuxbrew/Cellar/funannotate/bin/funannotate-predict.py --cpus 90 -i 00_Pan_merge_M200bp_fmtname_smask.fasta -o PANZ_funannotate_predict -s maize --name PANZEA --busco_seed_species maize --busco_db embryophyta --protein_evidence /public/home/stgui/work/PANZ_funannotate/Protein_evidence/00_zea_and_uniprot_Panicoideae_cdhit90.fasta --transcript_evidence /public/home/stgui/work/PANZ_funannotate/RNA_evidence/Pan_maize_and_TEO_RNA_cdhit_nonN.fasta --min_intronlen 5 --max_intronlen 60000 --min_protlen 10

[05/24/19 11:18:10]: OS: linux2, 192 cores, ~ 2113 GB RAM. Python: 2.7.16
[05/24/19 11:18:10]: Running funannotate v1.5.3-21ad095
[05/24/19 11:18:14]: Augustus training set for maize already exists. To re-train provide unique --augustus_species argument
[05/24/19 11:18:16]: AUGUSTUS (3.3.2) detected, version seems to be compatible with BRAKER and BUSCO
[05/24/19 11:19:00]: Loading genome assembly and parsing soft-masked repetitive sequences
[05/25/19 15:43:35]: Genome loaded: 2,951,918 scaffolds; 5,385,252,087 bp; 50.67% repeats masked
[05/25/19 15:43:58]: Aligning transcript evidence to genome with minimap2
[05/25/19 15:43:58]: /public/home/stgui/.linuxbrew/Cellar/funannotate/util/sam2bam.sh minimap2 -ax splice -t 90 --cs -u b -G 60000 /public/home/stgui/work/PANZ_funannotate/PANZ_funannotate_predict/predict_misc/genome.softmasked.fa PANZ_funannotate_predict/predict_misc/transcripts.combined.fa 4 PANZ_funannotate_predict/predict_misc/transcripts.minimap2.bam
[05/25/19 15:47:38]: [M::mm_idx_gen::113.815*1.88] collected minimizers
[M::mm_idx_gen::125.764*5.28] sorted minimizers
[WARNING] For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.
[M::main::125.764*5.28] loaded/built the index for 2251155 target sequence(s)
[M::mm_mapopt_update::132.233*5.07] mid_occ = 1642
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 2251155
[M::mm_idx_stat::134.507*5.00] distinct minimizers: 140856000 (29.25% are singletons); average occurrences: 9.591; average spacing: 2.961
[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 2
[main_samview] truncated file.
[ERROR] failed to write the results

[05/25/19 15:47:40]: Found 0 alignments, wrote GFF3 and Augustus hints to file
[05/25/19 15:47:43]: Mapping proteins to genome using Diamond blastx/Exonerate
# still running ...

OS/Install Information

You are running Perl v 5.028001. Now checking perl modules... Bio::Perl: 1.007002 Carp: 1.50 Clone: 0.41 DBD::SQLite: 1.62 DBI: 1.642 DB_File: 1.84 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.39 Logger::Simple: 2.0 POSIX: 1.84 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.08 Text::Soundex: 3.05 Thread::Queue: 3.13 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 threads: 2.22 threads::shared: 1.59 ERROR: DBD::mysql not installed, install with cpanm DBD::mysql

Checking external dependencies... Traceback (most recent call last): File "/public/home/stgui/.linuxbrew/bin/ete3", line 6, in from ete3.tools.ete import main File "/public/home/stgui/.linuxbrew/opt/python/lib/python3.7/site-packages/ete3/tools/ete.py", line 55, in from . import (ete_split, ete_expand, ete_annotate, ete_ncbiquery, ete_view, File "/public/home/stgui/.linuxbrew/opt/python/lib/python3.7/site-packages/ete3/tools/ete_view.py", line 48, in from .. import (Tree, PhyloTree, TextFace, RectFace, faces, TreeStyle, CircleFace, AttrFace, ImportError: cannot import name 'TextFace' from 'ete3' (/public/home/stgui/.linuxbrew/opt/python/lib/python3.7/site-packages/ete3/init.py) CodingQuarry: 2.0 RepeatMasker: RepeatMasker 4.0.9 RepeatModeler: RepeatModeler 1.0.8 Trinity: 2.8.3 augustus: 3.3.2 bamtools: bamtools 2.5.1 bedtools: bedtools v2.27.1 blat: BLAT v36 diamond: diamond 0.8.22 emapper.py: emapper-1.0.3 exonerate: exonerate 2.2.0 fasta: no way to determine gmap: 2015-09-29 gmes_petap.pl: 4.38 hisat2: 2.1.0 hmmscan: HMMER 3.1b2 (February 2015) hmmsearch: HMMER 3.1b2 (February 2015) java: 1.8.0_181-ojdkbuild kallisto: 0.44.0 mafft: v7.407 (2018/Jul/23) makeblastdb: makeblastdb 2.9.0+ minimap2: 2.17-r941 nucmer: 3.1 pslCDnaFilter: no way to determine rmblastn: rmblastn 2.9.0+ samtools: samtools 1.9 stringtie: 1.3.4d tRNAscan-SE: 2.0 (December 2017) tbl2asn: unknown, likely 25.3 tblastn: tblastn 2.9.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] ERROR: ete3 not installed Checking Environmental Variables... $FUNANNOTATE_DB=/public/home/stgui/work/funannotateDB $PASAHOME=/public/home/stgui/.linuxbrew/Cellar/PASApipeline-v2.3.3 $TRINITYHOME=/public/home/stgui/.linuxbrew/Cellar/trinity/2.8.3 $EVM_HOME=/public/home/stgui/.linuxbrew/Cellar/evidencemodeler/0.1.3 $AUGUSTUS_CONFIG_PATH=/public/home/stgui/.linuxbrew/Cellar/augustus/3.3.2/config $GENEMARK_PATH=/public/home/stgui/.linuxbrew/Cellar/gm_et_linux_64/gmes_petap $BAMTOOLS_PATH=/public/home/stgui/.linuxbrew/Cellar/bamtools/2.5.1/bin All 7 environmental variables are set


And I have  generated a `transcripts.minimap2.bam` by running Minimap2 with CMD below:

/public/home/stgui/.linuxbrew/Cellar/funannotate/util/sam2bam.sh "minimap2 -ax splice -t 36 --split-prefix ./tmp_split_prefixt -c -u b -G 60000 /public/home/stgui/work/PANZ_funannotate/PANZ_funannotate_predict/predict_misc/genome.softmasked.fa /public/home/stgui/work/PANZ_funannotate/PANZ_funannotate_predict/predict_misc/transcripts.combined.fa" 36 ./transcripts.minimap2.bam 1> logs.txt 2>&1


which finished correctly.

So I  was wondering that is it possible  to  generate `transcript_alignments.gff3` using the bam file I generated, and  pass the gff file to `funannotate predict` to skip the minimap2 alignment  step?

And is it possible to run `funannotate predict ` using Previously generated misc files? (To skip the time-consuming steps such as `parsing soft-masked repetitive sequences`)

Thankyou,

Best wishes ,

Songtao Gui 
nextgenusfs commented 5 years ago

Haven't seen this one before. Can you try to figure out why minimap2 died? i.e. this is what it is running:

minimap2 -ax splice -t 90 --cs -u b -G 60000 \
    PANZ_funannotate_predict/predict_misc/genome.softmasked.fa \
    PANZ_funannotate_predict/predict_misc/transcripts.combined.fa \
    | samtools sort -@4 -o transcript_alignments.bam - 

So is it dying because of the 90 threads? Or are you running out of memory with the index because of the large assembly? Seems like perhaps the latter, since you manually ran --split-prefix. What did you run for the indexing step then?

You can pass GFF3 transcript alignments to the --transcript_alignments option. To convert from minimap2 BAM file (note you must run with the --cs flag to GFF3, you can use funannotate util bam2gff3 script.

Funanntoate will re-use any existing data if you give it the same output command. I don't think it will re-use the repeat identification step, but I can look into that. It is multi-threaded.

You might want to consider cleaning your assembly prior to running annotation, ~3 million contigs is kind of crazy. I understand maize is a large genome, but thought it was more like 2.5GB, here you have 5.3GB. You likely won't get any gene annotation in contigs less than 10 kb in size. One of the reasons the repeat detection is taking so long is the 3 million contigs....

songtaogui commented 5 years ago

@nextgenusfs

Thank you for your help.

I am now trying to clean my inputs and rerun it with a larger memory.

I have noticed that there were two gff files for transcript alignment in the predict_misc dir: transcript_minimap2.gff3 and transcript_alignments.gff3, is there any manipulation before converting transcripts.minimap2.bam to transcript_alignments.gff3 ?

You might want to consider cleaning your assembly prior to running annotation, ~3 million contigs is kind of crazy. I understand maize is a large genome, but thought it was more like 2.5GB, here you have 5.3GB.

What I am trying to annotate are a bunch of non-reference sequences, that's why they are so fragmentary. And yes, I was planning to filter out short sequences and repeat-rich sequences prior to annotation.

You likely won't get any gene annotation in contigs less than 10 kb in size.

Do you have any suggestions on the appropriate options in annotating short contigs, because a large portion of my input sequences were less than 10Kb.

Thank you again for your kindly help.

Best wishes,

Songtao Gui

nextgenusfs commented 5 years ago

Depending on the settings you run it with, there may/may not be a difference between those two GFF3 files -- one is minimap2 alignments, however you can also have it run gmap/blat alignments as well. The transcript_alignments.gff3 is the combined results that are eventually passed to EvidenceModeler.

What is the average maize gene length? Basically the gene predictors need some context to predict genes, if you are using pre-trained Augustus parameters, you could literally just run Augustus on these contigs. But they aren't likely to be very informative. The goal of funannotate is to generate NCBI submission ready annotated genomes, ie you should be feeding it as input a cleaned up ready-to-publish genome assembly. It isn't designed to annotate short contigs/fragments ie such as a meta genome. What is your goal in trying to annotate short repetitive contigs?