Add multi-context seeds

Itolstoganov commented 5 months ago

This replaces randstrobe hashes with multi-context hashes. Multi-context hashes are defined as follows

((hash1 >> digest_size) << digest_size) ^ (hash2 >> (64 - digest_size))

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash. The 64 - digest_size prefix of the multi-context hash corresponding to hash1 is denoted as the main hash, and the digest_size suffix corresponding to hash2 is denoted as auxiliary hash.

The prefix of the size (64 - digest_size) corresponding to the smaller syncmer is used for partial search. If a given full 64-bit query hash was not found in the index, the seeds with the same main hash are looked up instead and added as the matches. These partial matches are used for the NAM construction in the same way as the full matches.

In order to calculate the reference range of the partial match correctly, we need to know which strobe was used as main. This information is stored in the m_packed field of RefRandstrobe, replacing one of the bits previously reserved for the reference index.

List of changes:

The randstrobe hash was replaced with the multi-context hash.
Partial match search was added to the find_nams function if full matches were not found. Partial matches have a larger abundance threshold than full matches due to higher seed repetitiveness.
Partial hit processing added to add_to_hits_per_ref.
The partial_find function that looks for the seeds with the same main hash as the query was added to the StrobemerIndex.
Some utility functions (for working with the main part of the hash) were added to the StrobemerIndex.
9th right-hand bit of the m_packed field of the RefRandstrobe is now set iff the first strobe was used as the main part of the multi-context seed.
Fields that store the start and end positions of the main strobe on the query, and the indicator of the main strobe are added to the QueryRandstrobe class.
The --digest parameter was added to regulate the auxiliary part of the multi-context hash with the default value of 24.

Itolstoganov commented 3 months ago

For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.

Sure, edited the PR description.

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash. Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?

Yes, the only reason for selecting minimal hash is to keep the hash symmetric.

This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.

I tried to alleviate that by removing all mentions of "digest" (by which I meant the same thing as the "auxiliary" part of the multi-context hash) and "subhash".

ksahlin commented 3 months ago

Is this PR ready for a larger benchmarking?

ksahlin commented 2 months ago

The randstrobe iterator for queries stops when there is still w_min syncmers left in the read by checking

    bool has_next() {
        return strobe1_index + w_min < syncmers.size();

This is expected behaviour for our current seeds. For example, w_min=1 for 50 and 75, and w_min=4 for 100.

However, I think mcs can be boosted further by adding the remaining syncmers 'in the ends of reads' as seeds. This means 2*w_min more seeds for a read (fw and rc ends).

I haven't thought carefully about the best way to change it in the code, but perhaps changing the bool has_next() to return strobe1_index < syncmers.size(); together with adding a case to return randstrobes with base/main hash as strobe1.hash and auxillary hash 0 in Randstrobe RandstrobeIterator when i < w_start?

ksahlin commented 2 months ago

I implemented and tested this briefly:

Changed to

bool has_next() {
    return strobe1_index < syncmers.size();
}

Added in RandstrobeIterator::get

if (syncmers.size() < w_start) {
    return Randstrobe{
            randstrobe_hash(strobe1.hash, strobe2.hash, aux_len),
            static_cast<uint32_t>(strobe1.position),
            static_cast<uint32_t>(strobe2.position), true
    };
}

The results only very slightly improved/nearly unchanged on a 'no variation' drosophila genome in SE mapping mode, but they improve substantially for a high variation simulation (see below for read lengths 100 and 150). This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

ref,read_length,tool,mode,%-aligned,accuracy,time(sec),Mem(MB)
droso_above10k_variation,100,strobealign-mcs_SE,align,48.815,44.2065,704,1.45,778.4
droso_above10k_variation,100,strobealign-mcs-more-seeds_SE,align,48.9885,44.2975,704,1.55,778.36

droso_above10k_variation,150,strobealign-mcs_SE,align,49.0685,44.9805,839,2.47,781.87
droso_above10k_variation,150,strobealign-mcs-more-seeds_SE,align,49.258,45.1925,839,2.8,787.6

Needs to be tested on larger genome(s) obv.

marcelm commented 2 months ago

This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

I am a bit worried about this. Doesn’t this fit the pattern that we often see worse variant detection rates for "optimized" mapping parameters? Shall I perhaps run the parameter optimization on data with higher variation?

ksahlin commented 2 months ago

Shall I perhaps run the parameter optimization on data with higher variation?

Sounds like a good suggestion to me! From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

rule mason_simulate_variants:
    input:  ref = config["HG38"]
    output: sim_vcf =  config["ROOT_OUT"] + "/reads/{dataset}/variations.vcf",
    run:
        if wildcards.dataset == "SIM1":
            shell("mason_variator -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM2":
            shell("mason_variator --sv-indel-rate 0.000001 --snp-rate 0.001 --small-indel-rate 0.00001 --max-small-indel-size 20  -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM3":
            shell("mason_variator --sv-indel-rate 0.000005 --snp-rate 0.001 --small-indel-rate 0.0001 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM4":
            shell("mason_variator --sv-indel-rate 0.00001 --snp-rate 0.005 --small-indel-rate 0.0005 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")

marcelm commented 2 months ago

From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Sorry if you got the wrong impression, but since I don’t store the index on disk, everything has to be recomputed. I’ll run this on a cluster somewhere to get the results faster this time.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

Good, I’ll use that and would like to call it SIM5; is that ok?

ksahlin commented 1 month ago

Good, I’ll use that and would like to call it SIM5; is that ok?

Yes, sounds good!

ksahlin / strobealign

Add multi-context seeds #388