ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Add multi-context seeds #388

Open Itolstoganov opened 5 months ago

Itolstoganov commented 5 months ago

This replaces randstrobe hashes with multi-context hashes. Multi-context hashes are defined as follows

((hash1 >> digest_size) << digest_size) ^ (hash2 >> (64 - digest_size))

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash. The 64 - digest_size prefix of the multi-context hash corresponding to hash1 is denoted as the main hash, and the digest_size suffix corresponding to hash2 is denoted as auxiliary hash.

The prefix of the size (64 - digest_size) corresponding to the smaller syncmer is used for partial search. If a given full 64-bit query hash was not found in the index, the seeds with the same main hash are looked up instead and added as the matches. These partial matches are used for the NAM construction in the same way as the full matches.

In order to calculate the reference range of the partial match correctly, we need to know which strobe was used as main. This information is stored in the m_packed field of RefRandstrobe, replacing one of the bits previously reserved for the reference index.

List of changes:

Itolstoganov commented 3 months ago

For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.

Sure, edited the PR description.

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash. Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?

Yes, the only reason for selecting minimal hash is to keep the hash symmetric.

This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.

I tried to alleviate that by removing all mentions of "digest" (by which I meant the same thing as the "auxiliary" part of the multi-context hash) and "subhash".

ksahlin commented 3 months ago

Is this PR ready for a larger benchmarking?

ksahlin commented 2 months ago

The randstrobe iterator for queries stops when there is still w_min syncmers left in the read by checking

    bool has_next() {
        return strobe1_index + w_min < syncmers.size();

This is expected behaviour for our current seeds. For example, w_min=1 for 50 and 75, and w_min=4 for 100.

However, I think mcs can be boosted further by adding the remaining syncmers 'in the ends of reads' as seeds. This means 2*w_min more seeds for a read (fw and rc ends).

I haven't thought carefully about the best way to change it in the code, but perhaps changing the bool has_next() to return strobe1_index < syncmers.size(); together with adding a case to return randstrobes with base/main hash as strobe1.hash and auxillary hash 0 in Randstrobe RandstrobeIterator when i < w_start?

ksahlin commented 2 months ago

I implemented and tested this briefly:

  1. Changed to

    bool has_next() {
        return strobe1_index < syncmers.size();
    }
  2. Added in RandstrobeIterator::get

    if (syncmers.size() < w_start) {
        return Randstrobe{
                randstrobe_hash(strobe1.hash, strobe2.hash, aux_len),
                static_cast<uint32_t>(strobe1.position),
                static_cast<uint32_t>(strobe2.position), true
        };
    }

The results only very slightly improved/nearly unchanged on a 'no variation' drosophila genome in SE mapping mode, but they improve substantially for a high variation simulation (see below for read lengths 100 and 150). This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

ref,read_length,tool,mode,%-aligned,accuracy,time(sec),Mem(MB)
droso_above10k_variation,100,strobealign-mcs_SE,align,48.815,44.2065,704,1.45,778.4
droso_above10k_variation,100,strobealign-mcs-more-seeds_SE,align,48.9885,44.2975,704,1.55,778.36

droso_above10k_variation,150,strobealign-mcs_SE,align,49.0685,44.9805,839,2.47,781.87
droso_above10k_variation,150,strobealign-mcs-more-seeds_SE,align,49.258,45.1925,839,2.8,787.6

Needs to be tested on larger genome(s) obv.

marcelm commented 2 months ago

This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

I am a bit worried about this. Doesn’t this fit the pattern that we often see worse variant detection rates for "optimized" mapping parameters? Shall I perhaps run the parameter optimization on data with higher variation?

ksahlin commented 2 months ago

Shall I perhaps run the parameter optimization on data with higher variation?

Sounds like a good suggestion to me! From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

rule mason_simulate_variants:
    input:  ref = config["HG38"]
    output: sim_vcf =  config["ROOT_OUT"] + "/reads/{dataset}/variations.vcf",
    run:
        if wildcards.dataset == "SIM1":
            shell("mason_variator -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM2":
            shell("mason_variator --sv-indel-rate 0.000001 --snp-rate 0.001 --small-indel-rate 0.00001 --max-small-indel-size 20  -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM3":
            shell("mason_variator --sv-indel-rate 0.000005 --snp-rate 0.001 --small-indel-rate 0.0001 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM4":
            shell("mason_variator --sv-indel-rate 0.00001 --snp-rate 0.005 --small-indel-rate 0.0005 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")
marcelm commented 2 months ago

From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Sorry if you got the wrong impression, but since I don’t store the index on disk, everything has to be recomputed. I’ll run this on a cluster somewhere to get the results faster this time.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

Good, I’ll use that and would like to call it SIM5; is that ok?

ksahlin commented 1 month ago

Good, I’ll use that and would like to call it SIM5; is that ok?

Yes, sounds good!