Open Itolstoganov opened 5 months ago
For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.
Sure, edited the PR description.
Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash. Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?
Yes, the only reason for selecting minimal hash is to keep the hash symmetric.
This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.
I tried to alleviate that by removing all mentions of "digest" (by which I meant the same thing as the "auxiliary" part of the multi-context hash) and "subhash".
Is this PR ready for a larger benchmarking?
The randstrobe iterator for queries stops when there is still w_min
syncmers left in the read by checking
bool has_next() {
return strobe1_index + w_min < syncmers.size();
This is expected behaviour for our current seeds. For example, w_min=1
for 50 and 75, and w_min=4
for 100.
However, I think mcs can be boosted further by adding the remaining syncmers 'in the ends of reads' as seeds. This means 2*w_min more seeds for a read (fw and rc ends).
I haven't thought carefully about the best way to change it in the code, but perhaps changing the bool has_next()
to return strobe1_index < syncmers.size();
together with adding a case to return randstrobes with base/main hash as strobe1.hash
and auxillary hash 0 in Randstrobe RandstrobeIterator
when i < w_start
?
I implemented and tested this briefly:
Changed to
bool has_next() {
return strobe1_index < syncmers.size();
}
Added in RandstrobeIterator::get
if (syncmers.size() < w_start) {
return Randstrobe{
randstrobe_hash(strobe1.hash, strobe2.hash, aux_len),
static_cast<uint32_t>(strobe1.position),
static_cast<uint32_t>(strobe2.position), true
};
}
The results only very slightly improved/nearly unchanged on a 'no variation' drosophila genome in SE mapping mode, but they improve substantially for a high variation simulation (see below for read lengths 100 and 150). This also suggest that we may be overfitting our optimization to high quality simulations without much variation.
ref,read_length,tool,mode,%-aligned,accuracy,time(sec),Mem(MB)
droso_above10k_variation,100,strobealign-mcs_SE,align,48.815,44.2065,704,1.45,778.4
droso_above10k_variation,100,strobealign-mcs-more-seeds_SE,align,48.9885,44.2975,704,1.55,778.36
droso_above10k_variation,150,strobealign-mcs_SE,align,49.0685,44.9805,839,2.47,781.87
droso_above10k_variation,150,strobealign-mcs-more-seeds_SE,align,49.258,45.1925,839,2.8,787.6
Needs to be tested on larger genome(s) obv.
This also suggest that we may be overfitting our optimization to high quality simulations without much variation.
I am a bit worried about this. Doesn’t this fit the pattern that we often see worse variant detection rates for "optimized" mapping parameters? Shall I perhaps run the parameter optimization on data with higher variation?
Shall I perhaps run the parameter optimization on data with higher variation?
Sounds like a good suggestion to me! From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.
Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100
to make it really challenging.
rule mason_simulate_variants:
input: ref = config["HG38"]
output: sim_vcf = config["ROOT_OUT"] + "/reads/{dataset}/variations.vcf",
run:
if wildcards.dataset == "SIM1":
shell("mason_variator -ir {input.ref} -ov {output.sim_vcf}")
elif wildcards.dataset == "SIM2":
shell("mason_variator --sv-indel-rate 0.000001 --snp-rate 0.001 --small-indel-rate 0.00001 --max-small-indel-size 20 -ir {input.ref} -ov {output.sim_vcf}")
elif wildcards.dataset == "SIM3":
shell("mason_variator --sv-indel-rate 0.000005 --snp-rate 0.001 --small-indel-rate 0.0001 --max-small-indel-size 50 -ir {input.ref} -ov {output.sim_vcf}")
elif wildcards.dataset == "SIM4":
shell("mason_variator --sv-indel-rate 0.00001 --snp-rate 0.005 --small-indel-rate 0.0005 --max-small-indel-size 50 -ir {input.ref} -ov {output.sim_vcf}")
From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.
Sorry if you got the wrong impression, but since I don’t store the index on disk, everything has to be recomputed. I’ll run this on a cluster somewhere to get the results faster this time.
Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set
--sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100
to make it really challenging.
Good, I’ll use that and would like to call it SIM5
; is that ok?
Good, I’ll use that and would like to call it SIM5; is that ok?
Yes, sounds good!
This replaces randstrobe hashes with multi-context hashes. Multi-context hashes are defined as follows
((hash1 >> digest_size) << digest_size) ^ (hash2 >> (64 - digest_size))
Where
hash1
andhash2
are the syncmer hashes, withhash1
being the smaller hash. The64 - digest_size
prefix of the multi-context hash corresponding tohash1
is denoted as the main hash, and thedigest_size
suffix corresponding tohash2
is denoted as auxiliary hash.The prefix of the size (
64 - digest_size
) corresponding to the smaller syncmer is used for partial search. If a given full 64-bit query hash was not found in the index, the seeds with the same main hash are looked up instead and added as the matches. These partial matches are used for the NAM construction in the same way as the full matches.In order to calculate the reference range of the partial match correctly, we need to know which strobe was used as main. This information is stored in the
m_packed
field ofRefRandstrobe
, replacing one of the bits previously reserved for the reference index.List of changes:
find_nams
function if full matches were not found. Partial matches have a larger abundance threshold than full matches due to higher seed repetitiveness.add_to_hits_per_ref
.partial_find
function that looks for the seeds with the same main hash as the query was added to theStrobemerIndex
.StrobemerIndex
.m_packed
field of theRefRandstrobe
is now set iff the first strobe was used as the main part of the multi-context seed.QueryRandstrobe
class.--digest
parameter was added to regulate the auxiliary part of the multi-context hash with the default value of 24.