infer_query_length ignores soft-clipped (S) bases

Manuel-DominguezCBG commented 8 months ago

Dear pysam developers,

Why infer_query_length method doesn't consider S bases as part of the length of the read?

I am developing a soft-clipping Python tool with Pysam due to this limitation found in Ampliconclip https://github.com/samtools/samtools/issues/1948

and I am developing some tests. In my custom soft-clipping tool, reads with a S (from the alignment tool) should be unmapped. The last assertion checks the length of the read

def test_soft_clip_from_bwa():
    """ This test includes an 'S' in the CIGAR string"""
    counter = get_new_counter()
    read = MockRead(100, [(4, 10), (0, 80)], '10S80M')
    primer_F_end = 110
    primer_R_start = 190
    softclip_positive_read1(read, primer_F_end, primer_R_start, counter)
    assert read.reference_id == -1, "Reference ID should be -1 for unmapped reads"
    assert read.reference_start == -1, "Reference start should be -1 for unmapped reads"
    assert read.mapping_quality == 0, "Mapping quality should be 0 for unmapped reads"
    assert read.cigarstring == "*", "CIGAR string should be '*' for unmapped reads"
    assert read.next_reference_id == -1, "Next reference ID should be -1 for unmapped reads"
    assert read.next_reference_start == -1, "Next reference start should be -1 for unmapped reads"
    assert read.template_length == 0, "Template length should be 0 for unmapped reads"
    assert read.flag & 4, "Unmapped flag should be set in the read's flag"
    assert counter['Anmapped due S, X or H'] == 1, "Counter for unmapped reads due to insertion in primer region should be incremented"
    expected_length = read.infer_query_length() + 10 # infer_query_length should include S in the length of the read ???
    actual_length = sum(length for op, length in read.cigartuples if op in [0, 1, 2, 4, 7, 8])  # Summing lengths for M, I, D, S, =, X
    assert expected_length == actual_length, f"Length mismatch: Expected {expected_length}, Actual {actual_length}"

and I have noticed that infer_query_length() doesn't sum S bases. Why?? Is this a bug??

I have added that +10 otherwise the test fails with E assert 80 == 90

jmarshall commented 8 months ago

As you can see from the source code (pysam/libcalignedsegment.pyx's infer_query_length() and calculateQueryLengthWithoutHardClipping()) and the tests (test_infer_query_length()), in fact infer_query_length() does count S CIGAR operations. It counts exactly the operations listed as “consum[ing] query” in the table in §1.4 of the SAM spec, as you would expect.

You don't show us your MockRead() or softclip_positive_read1() functions, but presumably one of them is not setting up the CIGAR-related fields of read in the way you expect.

Manuel-DominguezCBG commented 8 months ago

Uhh you may catch an error in my code (more likely and I have for sure less experience than any of you).

I show the whole script tests

import pytest
from softclip import softclip_positive_read1

# Mock the pysam.AlignedSegment object
class MockRead:
    def __init__(self, reference_start, cigartuples, cigarstring):
        self.reference_start = reference_start
        self.cigartuples = cigartuples
        self.cigarstring = cigarstring
        # Initialize additional attributes as needed
        self.reference_id = None
        self.mapping_quality = None
        self.next_reference_id = None
        self.next_reference_start = None
        self.template_length = None
        self.flag = 0  # Initialize flag

    def infer_query_length(self):
        # This function calculates the total length of the query sequence in the read,
        # considering only certain types of CIGAR operations.

        # It iterates over the CIGAR tuples of the read. Each tuple consists of an operation code (op) and a length.
        # The CIGAR operations are represented by their codes, such as 0 for 'M', 4 for 'S', etc.

        # The function sums the lengths of all operations except for soft clipping (S, op code 4) and hard clipping (H, op code 5).
        # Soft and hard clipping operations indicate portions of the read that are not aligned to the reference.
        # Therefore, these are excluded from the total length of the aligned part of the query sequence.

        # The result is the total aligned length of the query sequence.
        return sum(length for op, length in self.cigartuples if op != 4 and op != 5)

    def update_cigarstring(self):
        # This function updates the CIGAR string of the read based on its CIGAR tuples.

        # A dictionary maps CIGAR operation codes to their string representations:
        # M (code 0), I (code 1), D (code 2), N (code 3), S (code 4), H (code 5), P (code 6), '=' (code 7), X (code 8).
        cigar_ops = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S', 5: 'H', 6: 'P', 7: '=', 8: 'X'}

        # The CIGAR string is constructed by iterating over the CIGAR tuples.
        # For each tuple, the length and corresponding string representation of the operation are concatenated.
        # This results in the reconstructed CIGAR string that reflects the current state of the CIGAR tuples.

        # The CIGAR string is then assigned to the cigarstring attribute of the MockRead object.
        self.cigarstring = ''.join([f"{length}{cigar_ops[op]}" for op, length in self.cigartuples])

def get_new_counter():
    return {
        'Total N of reads':0,
        'Number of unmapped reads':0,
        'Reads in chr outside our ROI':0,
        'TOTAL reverse reads' :0,
        'Number of selected reverse reads':0,
        'TOTAL forwards reads' :0,
        'Number of selected forward reads':0,
        'Anmapped due S, X or H': 0,
        'I or D in F + primer': 0,
        'I or D in F - primer': 0,
        'Number of + Reads soft-clipped at first primer': 0,
        'Number of + Reads soft-clipped at second primer': 0,    
    }
# ... some tests here
# and test I shown in the description of the issue

def test_soft_clip_from_bwa():
    """ This test includes an 'S' in the CIGAR string"""
    counter = get_new_counter()
    read = MockRead(100, [(4, 10), (0, 80)], '10S80M')
    primer_F_end = 110
    primer_R_start = 190
    softclip_positive_read1(read, primer_F_end, primer_R_start, counter)
    assert read.reference_id == -1, "Reference ID should be -1 for unmapped reads"
    assert read.reference_start == -1, "Reference start should be -1 for unmapped reads"
    assert read.mapping_quality == 0, "Mapping quality should be 0 for unmapped reads"
    assert read.cigarstring == "*", "CIGAR string should be '*' for unmapped reads"
    assert read.next_reference_id == -1, "Next reference ID should be -1 for unmapped reads"
    assert read.next_reference_start == -1, "Next reference start should be -1 for unmapped reads"
    assert read.template_length == 0, "Template length should be 0 for unmapped reads"
    assert read.flag & 4, "Unmapped flag should be set in the read's flag"
    assert counter['Anmapped due S, X or H'] == 1, "Counter for unmapped reads due to insertion in primer region should be incremented"
    expected_length = read.infer_query_length() + 10 # infer_query_length should include S in the length of the read ??? https://github.com/pysam-developers/pysam/issues/1252
    actual_length = sum(length for op, length in read.cigartuples if op in [0, 1, 2, 4, 7, 8])  # Summing lengths for M, I, D, S, =, X
    assert expected_length == actual_length, f"Length mismatch: Expected {expected_length}, Actual {actual_length}"

and this is the function (not finished yet)

def softclip_positive_read1(read, primer_F_end, primer_R_start, counter):
    """
    Modify the CIGAR string of a forward read to check for unwanted CIGAR operations and indels affecting primer boundary.
    Unmap the read if it contains certain CIGAR operations or if indels affect the boundary.
    Always apply soft-clipping at the start of the read based on the primer length.

    Args:
    read (pysam.AlignedSegment): The read to be modified.
    primer_F_end (int): The position relative to the start of the read where the primer ends.
    counter (dict): A dictionary for counting different types of unmapped reads.

    Things to know
    - op stands for a CIGAR operation code.
        M (code 0): Alignment match (can be a sequence match or mismatch).
        I (code 1): Insertion to the reference.
        D (code 2): Deletion from the reference.
        N (code 3): Skipped region from the reference.
        S (code 4): Soft clipping (clipped sequences present in SEQ).
        H (code 5): Hard clipping (clipped sequences NOT present in SEQ).
        P (code 6): Padding (silent deletion from padded reference).
        = (code 7): Sequence match.
        X (code 8): Sequence mismatch.

    Returns:
    dict: Updated counter dictionary.
    """
    current_position = read.reference_start # Start position first primer e.g. 100
    primer_F_end_position =  primer_F_end  # End position of the first primer e.g. 110
    primer_R_start_position = primer_R_start  # Start position of the second primer e.g. 190
    read_length = read.infer_query_length()  # Total length of the read e.g. 100
    primer_R_end_position = current_position + read_length # e.g. 200
    softclip_start_length = max(primer_F_end_position - current_position, 0) # e.g. 10

    # 1) Unmap the reads due to unwanted CIGAR operations (S, X or H)
    has_unwanted_cigar_ops = any(op in [4, 5, 8] for op, _ in read.cigartuples)
    if has_unwanted_cigar_ops:

        unmap_read(read)
        counter['Anmapped due S, X or H'] += 1
        return counter

    # 2) Unmap reads if I or D found in primer regions
    for op, length in read.cigartuples:
        # Check if an 'I' or 'D' starts within the first primer region
        if op in [1, 2] and current_position < primer_F_end_position:
            unmap_read(read)
            counter['I or D in F + primer'] += 1
            return counter

        # Check if an 'I' starts within the second primer region
        if op == 1 and current_position >= primer_R_start_position:
            unmap_read(read)
            counter['I or D in F - primer'] += 1
            return counter

        # Check if a 'D' spans the second primer region
        if op == 2 and current_position < primer_R_start_position < current_position + length:
            unmap_read(read)
            counter['I or D in F - primer'] += 1
            return counter

        # Update current_position for 'M', 'D', 'N', '=', 'X' operations
        if op in [0, 2, 3, 7, 8]:
            current_position += length

    # 3) Soft-clip the firts primers
    if softclip_start_length > 0:
        counter['Number of + Reads soft-clipped at first primer'] += 1

    # Initialize a new CIGAR list
    new_cigar = []
    #print("Original", read.cigarstring)

    # Length we still need to adjust for soft-clipping at the start
    length_to_adjust_start = softclip_start_length

    # Following the example added above in comments
    # First primer 100 to 110
    # Second primer 190 to 200
    # In a 100 bases read
    # The CIGAR string is 100M
    # and the CIGAR tuple [(0, 100)] This is 0 for M and 100 for legth

    # Iterate through the CIGAR tuples of the read
    for op, length in read.cigartuples:
        # If we still need to adjust the start and the operation is a match (M)
        if length_to_adjust_start > 0 and op == 0:
            # If the operation's length is larger than what needs to be adjusted
            if length > length_to_adjust_start:
                # Adjust the operation's length and add it to the new CIGAR
                new_cigar.append((op, length - length_to_adjust_start))
                # Reset the length to adjust as we've accounted for all soft-clipping
                length_to_adjust_start = 0
            else:
                # Reduce the length to adjust
                length_to_adjust_start -= length
                # Skip adding this operation if it's fully consumed by soft-clipping
                if length_to_adjust_start <= 0:
                    continue
        else:
            # Add the operation to the new CIGAR as is
            new_cigar.append((op, length))

    # Prepend the soft-clipping operation if necessary
    if softclip_start_length > 0:
        new_cigar.insert(0, (4, softclip_start_length))

    # Update the read's CIGAR
    read.cigartuples = new_cigar
    #print("New ", read.cigarstring)

    # 4) Soft-clip the second primer (if neccesary)
    softclip_end_length = max(primer_R_end_position - current_position - read_length, 0)

    if softclip_end_length > 0:
        counter['Number of + Reads soft-clipped at second primer'] += 1

    # Initialize a new list for the updated CIGAR after processing both primers
    updated_cigar = []

    # Length we still need to adjust for soft-clipping at the end
    length_to_adjust_end = softclip_end_length

    # Iterate through the new CIGAR tuples (after processing the first primer)
    for op, length in new_cigar[::-1]:  # Reverse iteration for end soft-clipping
        # If we still need to adjust the end and the operation is a match (M)
        if length_to_adjust_end > 0 and op == 0:
            # If the operation's length is larger than what needs to be adjusted
            if length > length_to_adjust_end:
                # Adjust the operation's length and add it to the updated CIGAR
                updated_cigar.insert(0, (op, length - length_to_adjust_end))
                # Reset the length to adjust as we've accounted for all soft-clipping
                length_to_adjust_end = 0
            else:
                # Reduce the length to adjust
                length_to_adjust_end -= length
                # Skip adding this operation if it's fully consumed by soft-clipping
                if length_to_adjust_end <= 0:
                    continue
        else:
            # Add the operation to the updated CIGAR as is
            updated_cigar.insert(0, (op, length))

    # Prepend the soft-clipping operation for the end if necessary
    if softclip_end_length > 0:
        updated_cigar.append((4, softclip_end_length))

    # Update the read's CIGAR only if soft clipping was applied
    if softclip_start_length > 0 or softclip_end_length > 0:
        read.cigartuples = updated_cigar
    #print("Updated ", read.cigarstring)
    return counter

jmarshall commented 8 months ago

I wasn't asking to see the code — questions involving unknown problems in large amounts of code would be better posted as questions on https://biostars.org or https://bioinformatics.stackexchange.com.

The function being called is your MockRead.infer_query_length(). It handles several CIGAR operations including S differently from how pysam.AlignedSegment.infer_query_length() does.

pysam-developers / pysam

infer_query_length ignores soft-clipped (S) bases #1252