Closed lmjakt closed 1 year ago
I think bam_get_seq
will return the memory address holding the sequence, but the sequence will be zero length so you shouldn't dereference it.
As for hard clips, they're data which is totally absent, so l_qseq
is unaffected by the presence or size of hard clipping.
In other words, l_qseq
is the size of the sequence listed in the BAM structure, which is also the same as the length in SAM (except for "*" which is represents a zero byte string, but is obviously 1 char in SAM).
Thanks for the very quick reply.
If I understand you correctly that means that the only way to get the original read length of an entry is to parse the cigar string. Not too much of a problem as it should be the sum of the lengths of the cigar operations for which
(bam_cigar_type(cigar[i]) & 2) == 0
cheers,
Martin
Yes, but there are bam_cigar2qlen
and bam_cigar2rlen
functions to do this for you. The code is more or less as you state above (noting &1 vs &2):
hts_pos_t bam_cigar2qlen(int n_cigar, const uint32_t *cigar)
{
int k;
hts_pos_t l;
for (k = l = 0; k < n_cigar; ++k)
if (bam_cigar_type(bam_cigar_op(cigar[k]))&1)
l += bam_cigar_oplen(cigar[k]);
return l;
}
Hi James,
thanks for the follow-up and the reference to bam_cigar2qlen
.
But I might be a little bit confused about what is considered the 'query' and what hard clipping does. As I understand it, if I have a read of 100 bases where the first 10 do not align, but the remaining 90 do, then that can be represented as either:
10S90M
or
10H90M
In the former case, all 100 bases of the query sequence will be given in the SEQ
field of the bam file whereas in the latter case only the 90 matching ones will be reported. The choice of which way to report it is made by the aligner.
In the first case bam_cigar2qlen()
will return 100, but in the second case it will return 90 (as an H
operation does not consume the query sequence). That's why I suggested the:
(bam_cigar_type(cigar[i]) & 2) == 0
inequality, which should return 100 in both cases (which seems more reasonable to me, as the query is the same).
Am I missing something in my reasoning?
cheers,
Martin
No, you're not missing anything. However I don't think that logic will work as some both consume query and reference (eg match operators), and more than one consumes neither (although hard-clip is the most common, very occasionally multiple-alignment tools will also add BAM_CPAD ops). You'd be best to roll your own I think using bam_cigar_type(cigar[i]) & 1) || bam_cigar_op == BAM_CHARD_CLIP
.
Thanks for the clarifier; and pointing out the error in my conditional (kind of obvious, but easy to miss when rushing through too many things). Not too clear in my head right now (think I'm catching something), but am thinking XOR 2 might also work. Will think about it later.
cheers,
Martin
Apologies if this is covered elsewhere, but I've not been able to find anything at this time.
According to the sam / bam standard, a bam file may or may not contain the query sequence. What does:
return when the sequence is not stored in the bam file?
Related to this, what value does
bam1_core_t.l_qseq
hold when the sequence begins or ends with hard (H) clipping operations? Is it the length of the clipped sequence or the full sequence, and if the latter is it then necessary to parse the cigar string in order to determine the length of the sequence present in the bam file?thanks,
Martin