samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

Behaviour of `bam_mods_query_type()` / interface to `hts_base_mod_state` #1550

Closed cjw85 closed 1 year ago

cjw85 commented 1 year ago

The function bam_mods_query_type can be used to ask for the presence of a MM subtag based on the mod base code. It returns the additional information of such a tag if found. However, in the case that say both C+m and C-m are present, the caller will only ever receive knowledge of the first listed tag. So further, querying m and receiving back the details for C+m should not be taken to imply the non-existence of the C-m tag.

I've found it useful to be able to know whether a specific tag exists querying by the additional fields of the hts_base_mod_state type. The type is currently documented as being opaque (and indeed it isn't listed in sam.h) -- I'm wondering if it should be more public (there is already the function bam_mods_recorded which serves as a public interface to ->type). Alternatively the interface to bam_mods_query_type could be amended to take strand (and potentially implicit and canonical).

jkbonfield commented 1 year ago

The state is deliberately opaque as it gives us flexibility to change it without breaking the ABI. Things like the definition of MAX_BASE_MOD are completely arbitrary, and may change, plus the implementation may not be optimal.

Do you have a set of suggested new APIs?

We already have test/test_mod.c which has an extended (-x) option that calls bam_mods_query_type.

Using this trivial example:

r1  0   *   0   0   *   *   0   0   TACGCGAT    *   Mm:Z:C+m,1;C-m,1;   Ml:B:C,90,100

I see that test_mod a.sam gives output while test_mod -x a.sam fails part way. Changing one of the m to h solves this. So you're correct it doesn't like both C+m and C-m together. However looking at the https://github.com/samtools/htslib/blob/develop/test/base_mods/MM-double.sam file we see C+m and G-m. Ie we're recording the base type associated with the strand, as viewed from the original orientation sequenced and not the relative orientation for the listed strand. The specification also states this, explaining that the nucleotide is the base in the originally recorded sequence, and the +/- indicates which strand the modification is on.

Unless we somehow have a non-paired double-stranded sequence (ie not C≡G but C~C forming a non-tightly bound bubble) then C-m shouldn't ever exist. I'm not sure if it's possible to obtain and sequence double-stranded DNA with modified bases that break the usual base-pairing rules. My knowledge of the microbiology isn't sufficient here.

So some example data demonstrating your case would help.

I'll need a bit more time to investigate what's going on in the C+m;C-m case though and see whether it's legitimate for it to simply fail. Naively I would have assumed it'd just regurgitate the data, irrespective of whether it's an impossible case.

jkbonfield commented 1 year ago

Part of the problem here is the specification was changed post-implementation by the addition of the implicit vs explicit counting.

This meant that the one structure we did make public, is no longer up to the job:

typedef struct hts_base_mod {
    int modified_base;
    int canonical_base;
    int strand;
    int qual;
} hts_base_mod;

Hence given a modified base, bam_mods_query_type returns extra data including m_implicit as this was absent in hts_base_mod at the time of implementation. This is a demonstration of why making even more structures public would be a bad idea. I thought I could get away with having a public structure that was purposely created solely to match the specification, but sigh... even that changed! Maybe it's time for hts_base_mod2 including the implicit flag and maybe extra utility such as sequence position so it's not needed as its own argument?

I did wonder whether we should really be iterating over sequence bases and recording yes, no, not-recorded type results, but we know one day someone is going to do something nightmarish such as checking for m on every base and h only on CPG island bases, or something similar, giving us both implicit and explicit coordinates mixed together in the same MM line! That would make a sequence iterator very clunky, so modification iterator (and let the user deal with such bizarreness) feels safer.

Suggestions welcomed from those who are actually consuming this data. Your insights are almost certainly better than me second guessing things.

Edit: an alternative would be bam_mods_query_type2 which changes the returned m_strand, m_implicit and m_canonical to be filters if non-zero. So if all values (after deref) are 0 then it's the same as bam_mods_query_type, but if non-zero then they are used to restrict the returned value to only things matching the input filters. This is akin to your suggestion in your last line above. Strand could be something as simple as >0 and <0 with 0 being "don't care".

It doesn't really "feel right" though. It's basically an interface where the user would have to go through all combinations doing a search "is this here?", "is that here?" rather than just asking what is present. So I do feel like the correct solution is a better iterator.

cjw85 commented 1 year ago

😄 I agree its probably best not to make it public.

You've hit one of my basic problems: the implicit/explicit causes some issues. The rest of your answer is also very prescient --- I'll record my thoughts based on my recent work, but its all basically what you've already predicted!

When using the pileup API to traverse by reference position, and then getting the list of mods on a read the lack of an entry cannot be taken as having a canonical base; it now implies "canonical" or "no call" dependent on the type of subtags present.

One idea I arrived at was that I would be useful if when doing:

n_mods = 256;
hts_base_mod_state *mod_state = p->cd.p;  // p a bam_pileup1_t*
hts_base_mod allmod[n_mods];
int nm = bam_mods_at_qpos(p->b, p->qpos, mod_state, allmod, n_mods);
if (nm < 0 ) continue;
for (int k = 0; k < nm && k < n_mods; ++k) {
    hts_base_mod mod = allmod[k];
    ...

That allmod corresponded to a complete list of all present (perhaps only those relevant to the query base) subtags, with a present/absent flag. This is your idea,

iterating over sequence bases and recording yes, no, not-recorded type results,

To help myself in the short term, I wrote a function similar to your bam_mods_query_type2 simply to tell me if a subtag is present (but then I need to fork htslib to add this):

int query_mod_subtag(hts_base_mod_state *state, int qtype, int qcanonical, char qstrand, int qimplicit) {
    int found = 0;
    for (size_t i=0; i<state->nmods; ++i) {
        if ((state->type[i] == qtype || state->type[i] == -qtype)
                && state->canonical[i] == qcanonical
                // although strand is typed char and documented as + or -, its actually 0/1
                && "+-"[state->strand[i]] == qstrand
                && state->implicit[i] == qimplicit) {
            found = 1;
            break;
        }
    }
    return found;
}

You're correct, it's a bit clunky for a client to be calling that for every subtag they care about (for each read, for each position in the pileup).

On C+m & C-m vs. C+m & G-m I see that you are correct. When we are for instance indicating possibilities of hemimethylation in sequencing that has examined both strands, it is the latter than should be used.

jts commented 1 year ago

I've been looking at implementing the base modification API in rust_htslib and agree that it would be very nice for the details of implicit/explicit mode to be hidden from the caller.

jkbonfield commented 1 year ago

I'm looking at this again now.

it would be very nice for the details of implicit/explicit mode to be hidden from the caller.

What are you actually asking for here when we have a mix of implicit and explicit calls together.

Do you want to treat uncalled implicit calls as if they were explicit calls with a minimal quality value? That could work I guess. It would then make both implicit and explicit calls iterated by the same code and filtering out low quality calls is done by the calling code.

I'm unsure what this buys us though. If your code is doing that already (to cope with the explicit calls where a mod is explicitly recorded as having e.g. confidence 0) then it'd give the same results anyway for most scenarios I think.

  1. Sequence base-by-base iteration:

We could have a separate query function for any given mod type to report whether or not it's explicitly or implicitly encoded. I think this is @jts's base_mod_queryi idea? However it's already there as the existing bam_mods_query_type, unless I'm misunderstanding things.

Note if you have no mods returned, you can still use bam_mods_query_type to get the array of possible types so they can be looked up in bam_mods_query_type.

  1. Sequence base-by-base iteration, try 2:

An alternative for coordinate based queries (either specific coordinates or seq iterators like bam_mods_at_next_pos) that produce fake entries for the implicit mode with a fixed pre-specified low confidence for data that's not included. Eg:

while ((n = bam_next_basemod2(b, m, mods, sizeof(mods)/sizeof(*mods), &pos))) {  
    for (int j = 0; j < n; j++) {                                               
        if (mods[j].qual == HTS_MOD_ABSENT) // implicit no-call                 
            continue;                                                           

        // report mod                                                           
    }                                                                           
}                                                                               

So basically in implicit mode we fill out the uncalled modifications to turn them into explicit "not found" calls, with a predefined standard quality score (eg -1, but controlled via a constant).

That means we'd get many more calls to iterate over, but most of them would fall below are call threshold, and we can either filter them by checking equality to HTS_MOD_ABSENT or we could also filter all unlikely events by doing e.g. if (mods[j].qual < 5). It means both implicit and explicit calls would be returned and rejected by the same condition.

We could also have an API call that sets the implicit quality to be used for a specific modification type (or all types), so the user can control what quality should be used for implicit uncalled mods. That could be stored in the opaque state.

  1. Iterator by next modified base. If we have this with a minimum quality (0 for everything) then again we can hide implicit vs explicit.

With minimum quality 5 and an implicit non-called quality of e.g. 1 specified, then we just skip over the implicit absent calls. However if we set the implicit non-called quality as 10 (approx P(exists)=0.04) and ask to iterate over all observed modified bases with qual >= 5 then we'd be expected to return all those implicit non-called bases, as if they were written in explicit mode.

jkbonfield commented 1 year ago

Argh sorry I'm being dense. It's the reverse...

You want explicit calls that don't cover a base to have a specific unknown likelihood.

Maybe it amounts to the same logic though. Instead of a constant HTS_MOD_ABSENT meaning implicit and uncalled (which is to say it was assigned a low confidence that was below the call threshold), we have a constant HTS_MOD_UNCALLED quality which means we didn't even look at this site.

So again we'd want a single iterator that expands up the explicit mod types and fills out that field so calling software can do whatever it needs to. Likely simply not add it to the "absent" column in the stats array, so we're correctly sampling found / not-found / unknown instead of just found / not-found.

cjw85 commented 1 year ago

I have to admit I've lost the thead here a bit --- it will take me a while to get back in the game. Meanwhile I've passed the baton of handling this information to @ArtRand.

jkbonfield commented 1 year ago

You're not the only one, as is evident! Ignore my first message :)

For what it's worth, I have a modified bam_mods_at_next_pos function (I'll do others too) which fills out the mod list for explicitly defined mods with a HTS_MOD_UNCHECKED quality (-2). I also have HTS_MOD_UNKNOWN as -1 for when ML is absent, but that probably needs to be something else.

An example from the test/test_mod tool:

@CO ATCATCATTCCTACCGCTATAGCCT  r3; mixture
@CO       -  -   m-  -m -     --   implicit, absent => not found
@CO       ?  ?   -h  -- -     ??   explicit, absent => not checked
@CO 
r3  0   *   0   0   *   *   0   0   ATCATCATTCCTACCGCTATAGCCT   *Mm:Z:C+m.,2,2;C+h?,2,0,0,0,0;  Ml:B:C,200,160,10,170,5,20,5
0   A   
1   T   
2   C   C+h. 
3   A   
4   T   
5   C   C+h. 
6   A   
7   T   
8   T   
9   C   C+m200 C+h10 
10  C   C+h170 
11  T   
12  A   
13  C   C+h5 
14  C   C+m160 C+h20 
15  G   
16  C   C+h5 
17  T   
18  A   
19  T   
20  A   
21  G   
22  C   C+h. 
23  C   C+h. 
24  T   
jkbonfield commented 1 year ago

The code producing the above output is:

        for (i = 0; i < b->core.l_qseq; i++) {                                  
            char line[8192], *lp = line, *ep = line + sizeof(line);             
            n = bam_mods_at_next_pos2(b, m, mods, 5);                           
            lp += snprintf(lp, ep - lp, "%d\t%c\t",                             
                           i, seq_nt16_str[bam_seqi(bam_get_seq(b), i)]);       
            for (j = 0; j < n && j < 5; j++) {                                  
                char qstr[10];                                                  
                if (mods[j].qual < 0)                                           
                    qstr[0] = '.', qstr[1] = 0;                                 
                else                                                            
                    snprintf(qstr, 10, "%d", mods[j].qual);                     

                if (extended) {                                                 
                    int m_strand, m_implicit;                                   
                    char m_canonical;                                           
                    int ret = bam_mods_query_type(m, mods[j].modified_base,     
                                                  &m_strand, &m_implicit,       
                                                  &m_canonical);                
                    if (ret < 0 ||                                              
                        m_canonical != mods[j].canonical_base ||                
                        m_strand    != mods[j].strand)                          
                        goto err;                                               
                    lp += snprintf(lp, ep - lp, "%c%c%s%c%s ",                  
                                   mods[j].canonical_base,                      
                                   "+-"[mods[j].strand],                        
                                   code(mods[j].modified_base),                 
                                   "?."[m_implicit],                            
                                   qstr);                                       
                } else {                                                        
                    lp += snprintf(lp, ep - lp, "%c%c%s%s ",                    
                                   mods[j].canonical_base,                      
                                   "+-"[mods[j].strand],                        
                                   code(mods[j].modified_base),                 
                                   qstr);                                       
                }                                                               
            }                                                                   
            *lp++ = '\n';                                                       
            *lp++ = 0;                                                          

            printf("%s", line);                                                                             

This is just a test. I think we need either some hidden state where to modify the behaviour of the functions by setting a flag in state, or we need modified API (bam_mods_at_next_pos2 etc) that takes an extra bit-field flag argument and both bam_mods_at_next_pos is just a redirect to bam_mods_at_next_pos2 is 0 flag.

The former could also work with a parameterised bam_parse_basemod, which sets the internal flag. I think this is probably the cleanest API change.

bam_parse_basemod2(b, m, HTS_MOD_REPORT_UNCHECKED);

jts commented 1 year ago

I think this is @jts's base_mod_queryi idea? However it's already there as the existing bam_mods_query_type, unless I'm misunderstanding things. Note if you have no mods returned, you can still use bam_mods_query_type to get the array of possible types so they can be looked up in bam_mods_query_type.

Just to clarify this point, the problem I have is that when there are multiple modifications with the same code (like in the MM-double test case) base_mods_recorded will give you an array like ['m', 'm', 'o'] but you can only use base_mod_query_type to get the strand/explicit/canonical data for the first m and the o, but not the second m. So my proposal is base_mod_queryi, which would fill in strand/explicit/canonical for the i-th element returned by base_mods_recorded.

(I haven't looked over the other proposals, but will make time to do so)

Jared

jkbonfield commented 1 year ago

Ok I got it. Yes I hadn't spotted that.