jguhlin / minimap2-rs

Rust bindings to minimap2 library
Other
63 stars 13 forks source link

Bug(s?): Soft clipping in cigar outputs #51

Open holtjma opened 7 months ago

holtjma commented 7 months ago

Hello,

I came across an issue with soft clipping results in the output. I think at least one of these is probably a bug, the other I'm not sure if I'm just missing an option to enable it. For context, I'm using minimap2 = "0.1.16" and mapping extracted reference sequences into a read (I can provide more details around how I'm calling .map() if needed).

Here are some example output Mappings where I saw the issue:

Mapping { query_name: None, query_len: Some(4000), query_start: 0, query_end: 2613, strand: Forward, target_name: Some("N/A"), target_len: 12338, target_start: 9728, target_end: 12338, match_len: 2604, block_len: 2613, mapq: 60, is_primary: true, alignment: Some(Alignment { nm: 9, cigar: Some([(1270, 0), (1, 1), (91, 0), (1, 1), (871, 0), (1, 1), (378, 0)]), cigar_str: Some("1270M1I91M1I871M1I378MS1387"), md: Some("186G817G13A2C17G6T1563"), cs: Some(":186*ga:817*ga:13*ag:2*ct:17*ga:6*tc:223+t:91+g:871+c:378") }) }
Mapping { query_name: None, query_len: Some(4290), query_start: 2407, query_end: 4288, strand: Forward, target_name: Some("N/A"), target_len: 11711, target_start: 0, target_end: 1903, match_len: 1762, block_len: 1925, mapq: 60, is_primary: true, alignment: Some(Alignment { nm: 163, cigar: Some([(31, 0), (1, 2), (42, 0), (1, 1), (279, 0), (1, 1), (163, 0), (5, 2), (36, 0), (2, 1), (13, 0), (21, 2), (55, 0), (9, 2), (6, 0), (6, 2), (48, 0), (5, 1), (13, 0), (8, 1), (77, 0), (1, 1), (11, 0), (1, 1), (275, 0), (1, 1), (39, 0), (1, 2), (528, 0), (1, 1), (89, 0), (1, 1), (67, 0), (1, 2), (87, 0)]), cigar_str: Some("2407S31M1D42M1I279M1I163M5D36M2I13M21D55M9D6M6D48M5I13M8I77M1I11M1I275M1I39M1D528M1I89M1I67M1D87MS2"), md: Some("31^G192G47A14T14C4G11A2A8A3A17A0T5C1C0T17G2T1T10T18T0G2T4C8T2G7A3C2T29G5G6G6C6G1C4^CGCCA12T0T0A5C3T13C10^ACCGGATTCCAGCTGGGAAAT2G0C7A4T0T5C3C1A5C6G9A2^CGTCACAAG3C2^CCTCGT2C2T0G3A8T0A14T6G26T5C6G6C2T2C1A3C2A9T0C0A0C9T10T1T17T25A7T4G0T7A16G12T9G130C85^C10C421C5T6A39G15C144G5T24C6^G39C30A0C15"), cs: Some(":31-g:42+c:150*gc:47*ag:14*tc:14*cg:4*gt:11*ac:2*ag:8*ac:3*ag:17+g*ag*tc:5*ca:1*ca*ta:17*ga:2*tc:1*tg:10*tg:18*tc*gt:2*tc:4*ca:8*tc:2*gc:7*ag:3*ct:2*tg:29*ga:5*gc:6*ga:6*ct:6*gc:1*ct:4-cgcca:12*tc*tc*ag:5*ct:3*tc:11+ca:2*cg:10-accggattccagctgggaaat:2*ga*ct:7*ag:4*tc*tc:5*ct:3*cg:1*ag:5*cg:6*ga:9*ac:2-cgtcacaag:3*ca:2-cctcgt:2*ca:2*tg*ga:3*ag:8*tc*ac:14*tc:6*ga:5+ctcag:13+ctttatct:8*tc:5*cg:6*ga:6*cg:2*tg:2*cg:1*at:3*ct:2*ag:9*tc*ct*ag*ct:9*tg:10+c*tg:1*tg:8+c:9*tc:25*ag:7*tc:4*gt*ta:7*ag:16*gc:12*tc:9*gt:130*ca:46+g:39-c:10*cg:421*ct:5*tc:6*ac:39*ga:15*ca:26+g:89+a:29*ga:5*tc:24*ct:6-g:39*ct:30*ag*cg:15") }) }

The key values of the cigar string and cigar Vec, I have extracted below:

cigar: Some([(1270, 0), (1, 1), (91, 0), (1, 1), (871, 0), (1, 1), (378, 0)]), 
cigar_str: Some("1270M1I91M1I871M1I378MS1387")
...
cigar: Some([(31, 0), (1, 2), (42, 0), (1, 1), (279, 0), (1, 1), (163, 0), (5, 2), (36, 0), (2, 1), (13, 0), (21, 2), (55, 0), (9, 2), (6, 0), (6, 2), (48, 0), (5, 1), (13, 0), (8, 1), (77, 0), (1, 1), (11, 0), (1, 1), (275, 0), (1, 1), (39, 0), (1, 2), (528, 0), (1, 1), (89, 0), (1, 1), (67, 0), (1, 2), (87, 0)]), 
cigar_str: Some("2407S31M1D42M1I279M1I163M5D36M2I13M21D55M9D6M6D48M5I13M8I77M1I11M1I275M1I39M1D528M1I89M1I67M1D87MS2")

There are two issues I noticed in these outputs:

  1. In the first example, the final cigar string sequence is "S1387", but I think it should be "1387S" to match the rest of the cigar. This is the one I think is likely a bug.
  2. In both examples, I don't see the soft-clipping values reflects in the cigar Vec. I'm not sure if this is a bug, or if I'm just missing an option to make them appear.

I found a workaround using the query_start and query_end fields to get these soft-clipping values. Are you able to confirm if these are bugs or if there's something I'm missing in the usage?

Thanks for maintaining this crate! Matt

jguhlin commented 7 months ago

Definitely a bug. I'm just getting over being sick so it'll probably be next week before I can tackle these though!

holtjma commented 7 months ago

Hope you feel better!

jguhlin commented 6 months ago

Almost better. :/ Working on this today, hoping to get it pushed out soon (coincides with the minimap2 version update, so that's actually great!)

jguhlin commented 6 months ago

I believe 1 is fixed. 2 may need an additional option. Minimap doesn't output soft clipping options unless it is in SAM format, and does not output it in PAF. So it's not as automated in the upstream library. Will keep working on it though, but it may need to be an additional option.

jguhlin commented 6 months ago

@holtjma I've added the option with_cigar_clipping for aligner (aligner.with_cigar_clipping()).

I'm keeping it this way as minimap2 does not return the CIGAR with soft clipping except in the string (never seen in PAF, but seen in SAM, for example). Trying to keep the library as close to minimap2 as possible when in default mode. Let me know if that works.

I don't have a good example for soft clipping on the front, but I see minimap2 has the workings for it, so if you come up with something you can share, let me know!

holtjma commented 6 months ago

That sounds good to me!

I'm not sure why minimap2 does it that way, but I definitely understand the desire to keep it as close as possible. Happy to try everything out once it's ready. Thanks again for taking the time to work on this!