lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.78k stars 407 forks source link

splice sites in junc-bed file to override default settings #505

Open vkkodali opened 4 years ago

vkkodali commented 4 years ago

I think the information in the junc-bed file can be better utilized by minimap2 in dealing with cases that deviate from the default settings. Two such cases:

  1. When there are non-consensus splice junctions in the junc-bed file, minimap2 should be able to use those instead of introducing small indels to generate the alignment with consensus splice sites.
  2. When there is an intron that is >200kb (the default max for intron length) in the junc-bed file, minimap2 should use that information to generate an alignment with a large intron.

A couple of specific examples to demonstrate this: The splice junctions file file and the query fasta file are attached.
Chromosome sequence can be downloaded from NCBI FTP path as shown below:

curl -O "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr2.fna.gz"

minimap2 was executed as follows:

~/bin/minimap2 -ax splice -C 5 --eqx --MD --cs --junc-bed splice_junctions.bed.gz chr2.fna.gz query.fa.gz > aligns.sam 

The query gnl|SRA|SRR1803611.121425.1 is expected to align to the subject with non-consensus splice sites. These are in the splice_junctions.bed file. However, minimap2 aligns this query with consensus splice sites by introducing a 3 nt deletion. The query gnl|SRA|SRR1803617.262344.1 is expected to align to the subject with an intron >200kb which, again, is in the splice_junctions.bed file. However, minimap2 aligns this query with a 570nt unaligned tail.

lh3 commented 4 years ago

Thanks. Very good suggestions. I will consider this.

vkkodali commented 4 years ago

Thanks. Very good suggestions. I will consider this.

Much appreciated. I'd be happy to provide additional examples, and help with review/testing if you need.