jguhlin / minimap2-rs

Rust bindings to minimap2 library
Other
63 stars 13 forks source link

loss of S values in cigar string #31

Closed scfurl closed 1 year ago

scfurl commented 1 year ago

Thanks again @jguhlin for making minimap2-rs. I think I may have found a problem with the cigar strings created by minimap2-rs.

Here is a molecule sequenced using Pacbio and aligned using minimap2 from the command line as follows:

minimap2 -a -L -cs --MD -x map-hifi align.fasta test.fastq.gz

molecule/87     0       DRB1|04:01:01:01        1       60      95S104M1D36M1I77M1D97M1D454M1D29M307S   *       0       0       TTTCTTATATGGGGGGGGGGTGGAGGGGGGCCATAGTTCTCCCTGATTGAGGACTTGCCTGCTGCTGTGACCACTGGTCTGTCCTCTTCTCCAGCATGGTGTGTCTGAAGCTCCCTGGAGGCTCCTGTATGGCAGCGCTGACAGTGACATTGACGGTGCTGAGCTCCCCACTGGCTTTGGCTGGGGACACCCAACCACGTTCTTGGAGCAGGCTAAGTGTGAGTGTCATTTCCTCAAATGGGACGGAGCGAGTGTGGAACCTGATCAGATACATCTATAACCAAGAGGAGTACGCGCGCTACAACAGTGACCTGGGGAGTACCAGGCGGTGACGGAGCTGGGGCGGCCTGACGCTGAGTACTGGAACAGCCAGAAGGACCTCCTGGAGCGGAGGCGGGCCGAGGTGGACACTACTGCAGATACAACTACGGGGTTGTGGAGAGCTTCACAGTGCAGCGGCGAGTCCAACCTAAGGTGACTGTGTATCCTTCAAAGACCCAGCCCCTGCAGCACCACAACCTCCTGGTCTGCTCTGTGAATGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCGGAACGGCCAGGAAGAGAAGGCTGGGGTGGTGTCCACAGGCCTGATCCAGAATGGAGACTGGACCTTCCAGACCCTGGTGATGCTGGAAACAGTTCCTCGGAGTGGAGAGGTTTACACCTGCCAAGTGGAGCATCCAAGCATGATGAGCCCTCTCACAGTGCAATGGAGTGCACGGTCTGAATCTGCACAGAGCAAGATGCTGAGTGGAGTCGGGGGCTTTGTGCTGGGCCTGCTCTTCCTTGGGACAGGGCTGTTCATCTACTTCAGGAATCAGAA
AGGACACTCTGACTTCAGCCAACAGGACTCTTGAGCTGAAGTGCAGATGACCACATTCAAGGAAGAACCTTCTGCCCAGCTTTGCAAGATGAAAAGCTTTCCCACTTGGCTCTTATTCTTCCACAAGAGCTTGTCAGGACCAGGTTGTTACTGGTTCAGCAACTCTGCAGAAAATGTCCTCCCTTGTGGCTTCCTTAGCTCCTGTTCTTGGCCTGAAGCCTCACAGCTTTGATGGCAGTGCCTCATCTTCAACTTTTGTGCTTCCCTTTACCTAAACTGTCCTGCCTCCCGTGCATCTGTACTCCCCTTGTGCCACACATTGCATATTAAATGTTTCTCAAACATT

When using an minimap2-rs aligner created like this:

    let aligner = Aligner::builder()
        .map_hifi()
        .with_threads(1)
        .with_cigar()
        .with_index("data/align.fasta", None)
        .expect("Unable to build index");

and then mapping a record sequence like this:

        let mapping = aligner
            .map(&record.seq().as_bytes(), false, false, None, None)
            .expect("Unable to align");
        eprintln!("{:?}", &mapping);

I get this:

Mapping { query_name: None, query_len: Some(1200), query_start: 95, query_end: 893, strand: Forward, target_name: Some("DRB1|04:01:01:01"), target_len: 801, target_start: 0, target_end: 801, match_len: 748, block_len: 802, mapq: 60, is_primary: true, alignment: Some(Alignment { nm: 54, cigar: Some([(104, 0), (1, 2), (36, 0), (1, 1), (77, 0), (1, 2), (97, 0), (1, 2), (454, 0), (1, 2), (29, 0)]), cigar_str: `Some("104M1D36M1I77M1D97M1D454M1D29M"),` md: None, cs: None }) }

`

You can see that the cigar doesn't contain the 95S in the front and the 307S in the back. This becomes an issue when trying to create a bam file from the output because the cigar length does not match the seq length.

Thanks for taking a look!

jguhlin commented 1 year ago

Working on a new version to go with the new minimap release, will try to get this fixed.

jguhlin commented 1 year ago

This is fixed, but more hardcoded than I'd like it to be. Going to push it out; just be aware the softclipping tags will always be there now. I can look into a better solution if you need it. Will come out in the next release, hopefully soon!

jguhlin commented 1 year ago

Should be working now. Let me know if you get a chance to test it!