lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.82k stars 415 forks source link

Obtaining primary alignment position from secondary alignment? #526

Open joshuak94 opened 4 years ago

joshuak94 commented 4 years ago

Hello!

When it comes to structural variant calling, one characteristic of the presence thereof is a noisier sequence than usual. With these noisy sequences, the secondary alignment can then provide some useful information about the potential structural variant.

From what I can currently see, Minimap2 has the option to output secondary alignments, but doesn't provide any information regarding the position of the primary alignment. One could sort the BAM file by name and then have all of the primary and secondary alignments in order, but this is not ideal as most programs require a sort by coordinate. The other alternative is to search for and store the primary alignment every time a new secondary alignment is encountered, which is also not ideal because this parses the entire BAM file multiple times, in the worst case.

I understand that sequence information is left out of secondary alignments to save space, but is there any way to include the position of the primary alignment? I was thinking in the optional alignment field.

d-cameron commented 4 years ago

Secondary or supplementary alignment? Supplementary alignment (aka split read alignments) locations should use the SA tag. There is no spec-defined equivalent to record the multiple different alternative read alignments (ie secondary alignments) but the IH, HI, and NH tags track their counts, and the CC and CP are defined as the next hit coordinates. The 'next hit' definition was intended to wrap around (or next hit of the last hit should be the first hit) but unfortunately that clarification isn't in the Sam tags spec yet.

Tldr: aligners should use the SA tag if you mean supplementary alignment. CC and CP tags if you really meant secondary alignments.

joshuak94 commented 4 years ago

Hey Daniel!

I do mean secondary alignments. Supplementary alignments are also important for variant calling, but as you mention they use the SA tag. Crucially, for supplementary alignments, all positions of the chimeric alignment are available.

I didn't know about CC and CP, thanks! It seems that minimap2 does not currently produce CC and CP output. While they are not as useful as the SA tag (SA gives a position of all chimeric read alignments and the position of the primary alignment whereas CC and CP only give the next hit), they certainly allow finding the primary alignment in constant time.

So I guess this is now a feature request, to have CC and CP tags output by minimap2, either by default or with some sort of a flag.

lh3 commented 4 years ago

I know this is an old thread. One question: why do you want to know the primary location? With option -Y, minimap2 can optionally output the read sequences for secondary mappings.

ines-batista commented 4 years ago

Secondary or supplementary alignment? Supplementary alignment (aka split read alignments) locations should use the SA tag. There is no spec-defined equivalent to record the multiple different alternative read alignments (ie secondary alignments) but the IH, HI, and NH tags track their counts, and the CC and CP are defined as the next hit coordinates. The 'next hit' definition was intended to wrap around (or next hit of the last hit should be the first hit) but unfortunately that clarification isn't in the Sam tags spec yet.

Tldr: aligners should use the SA tag if you mean supplementary alignment. CC and CP tags if you really meant secondary alignments.

Does minimap2 includes NH tag on the produced SAM files?

I want to use the alignment files in (SAM format) to count the reads with htseq-count and I'm having always zeros in the __alignment_not_unique special counter. From this manual I saw that it searches for NHtags but I didn't find them on my SAM files so I'm not sure if minimap2 includes them.

Thanks in advance!