samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

return value of bam_get_seq() when sequence not stored in the bam file #1531

Closed lmjakt closed 1 year ago

lmjakt commented 1 year ago

Apologies if this is covered elsewhere, but I've not been able to find anything at this time.

According to the sam / bam standard, a bam file may or may not contain the query sequence. What does:

bam_get_seq(b)

return when the sequence is not stored in the bam file?

Related to this, what value does bam1_core_t.l_qseq hold when the sequence begins or ends with hard (H) clipping operations? Is it the length of the clipped sequence or the full sequence, and if the latter is it then necessary to parse the cigar string in order to determine the length of the sequence present in the bam file?

thanks,

Martin

jkbonfield commented 1 year ago

I think bam_get_seq will return the memory address holding the sequence, but the sequence will be zero length so you shouldn't dereference it.

As for hard clips, they're data which is totally absent, so l_qseq is unaffected by the presence or size of hard clipping.

In other words, l_qseq is the size of the sequence listed in the BAM structure, which is also the same as the length in SAM (except for "*" which is represents a zero byte string, but is obviously 1 char in SAM).

lmjakt commented 1 year ago

Thanks for the very quick reply.

If I understand you correctly that means that the only way to get the original read length of an entry is to parse the cigar string. Not too much of a problem as it should be the sum of the lengths of the cigar operations for which

(bam_cigar_type(cigar[i]) & 2) == 0

cheers,

Martin

jkbonfield commented 1 year ago

Yes, but there are bam_cigar2qlen and bam_cigar2rlen functions to do this for you. The code is more or less as you state above (noting &1 vs &2):

hts_pos_t bam_cigar2qlen(int n_cigar, const uint32_t *cigar)
{
    int k;
    hts_pos_t l;
    for (k = l = 0; k < n_cigar; ++k)
        if (bam_cigar_type(bam_cigar_op(cigar[k]))&1)
            l += bam_cigar_oplen(cigar[k]);
    return l;
}
lmjakt commented 1 year ago

Hi James,

thanks for the follow-up and the reference to bam_cigar2qlen.

But I might be a little bit confused about what is considered the 'query' and what hard clipping does. As I understand it, if I have a read of 100 bases where the first 10 do not align, but the remaining 90 do, then that can be represented as either:

10S90M
or
10H90M

In the former case, all 100 bases of the query sequence will be given in the SEQ field of the bam file whereas in the latter case only the 90 matching ones will be reported. The choice of which way to report it is made by the aligner.

In the first case bam_cigar2qlen() will return 100, but in the second case it will return 90 (as an H operation does not consume the query sequence). That's why I suggested the:

(bam_cigar_type(cigar[i]) & 2) == 0

inequality, which should return 100 in both cases (which seems more reasonable to me, as the query is the same).

Am I missing something in my reasoning?

cheers,

Martin

jkbonfield commented 1 year ago

No, you're not missing anything. However I don't think that logic will work as some both consume query and reference (eg match operators), and more than one consumes neither (although hard-clip is the most common, very occasionally multiple-alignment tools will also add BAM_CPAD ops). You'd be best to roll your own I think using bam_cigar_type(cigar[i]) & 1) || bam_cigar_op == BAM_CHARD_CLIP.

lmjakt commented 1 year ago

Thanks for the clarifier; and pointing out the error in my conditional (kind of obvious, but easy to miss when rushing through too many things). Not too clear in my head right now (think I'm catching something), but am thinking XOR 2 might also work. Will think about it later.

cheers,

Martin