Closed marcelm closed 1 month ago
I need to check whether the "Number of rescue NAMs" figure is actually correct as it is so high.
One problem appears to be that we have two different types of "hits".
The first type of hit is a randstrobe successfully looked up in the index, represented by the start position in the StrobemerIndex::randstrobes
vector (first entry with the same hash as the randstrobe).
In find_nams
, for example, nr_good_hits
and total_hits
are used with that meaning:
https://github.com/ksahlin/strobealign/blob/a5bd1fd8f1b450109849d8e1910c1d8e250fafd7/src/nam.cpp#L165-L173
Also, the RescueHit
(in the same file) is such a hit.
The second type is an interval on the query paired with an interval on the reference. This is, in particular, the Hit
struct:
https://github.com/ksahlin/strobealign/blob/a5bd1fd8f1b450109849d8e1910c1d8e250fafd7/src/nam.cpp#L5-L10
You get a list of such hits when one takes a hit according to the first definition and iterates through all index positions at which the hash matches.
This meaning of hit is used in, for example, merge_hits_into_nams
, add_to_hits_per_ref
, Nam::n_hits
.
The new statistics output in this PR uses the first meaning of hit, but I thought it was the second. With the second meaning, we would expect to see fewer NAMs then hits (because hits are merged into NAMs), which made me suspicious, but since it is the first meaning, this does not have to be the case.
Any suggestions for a better name for one of the two variants of hits?
For printing diagnostics, maybe we could use distinct hash matches
for the first type and seed/hit locations
for the second type? These are not ideal variable names if you refer to changing variable names. I think 'position/location' or 'pos/loc' should be included in the second as it is the characteristic that separates them.
I thought about this some more and have a suggestion that would feel quite natural to me and requires only minimal changes: We simply rename the Hit
struct to Match
.
That is, we continue using the term "hit" for strobemer successfully looked up in the index and then use the term "match" for a locus on the query paired with a locus on the reference.
I think this would be quite nice because we merge the latter entities into NAMs, and then we could say that we merge multiple overlapping matches into non-overlapping, approximate matches (=NAMs).
I think the logging output could actually stay as it is (it is only shown with -v
anyway and is mostly useful for us). But if we want, we could extend it print the number of matches:
Number of reads: 1000000
Number of non-rescue hits: 25160733 total. Per read: 25.16
Number of non-rescue matches: xyz total. Per read: xx.yy
Number of non-rescue NAMs: 27321159 total. Per read: 27.32
Number of rescue hits: 583028 total. Per rescue attempt: 23.07
Number of rescue matches: xyz total. Per rescue attempt: xx.yy
Number of rescue NAMs: 11457402 total. Per rescue attempt: 453.27
I tried to use "distinct hash matches" and "query randstrobes found in index", but it feels a bit contorted. And these messages are not self-explanatory anyway, so they might as well be concise and use clearly defined terms.
I took the liberty to push a commit that makes the above change so that you can see for yourself (if you want).
Ok, I like the suggestion.
Some Qs:
Number of non-rescue hits
incremented with 1 or with count(hit)
for each seed found in the index? Number of non-rescue NAMs
be larger than Number of non-rescue hits
in the numbers above? Assuming you haven't made up the numbers in the example. (related to Q2?)
- Does this mean that a hit does not become a match only if the hit was too repetitive (filtered out)?
Yes. (The filtering criteria on which this depends are of course different for rescue and non-rescue.)
- is
Number of non-rescue hits
incremented with 1 or withcount(hit)
for each seed found in the index?
By 1.
- How can
Number of non-rescue NAMs
be larger thanNumber of non-rescue hits
in the numbers above? Assuming you haven't made up the numbers in the example. (related to Q2?)
This is what confused me as well and made me wonder whether the numbers are correct. But with what I answered for Q2, I think it’s clear:
And there’s no reason $k$ cannot be larger than $n$.
I think if we also report the number of matches (as an additional line between the number of of hits and NAMs), it may be a bit more obvious what is going on: The numbers go up from hit to match and down again because matches are merged.
- [...] for each seed found in the index?
By the way, does a "seed" correspond to a "hit" or a "match"? Or do you use the term loosely and it be either?
It would also be totally fine with me to use "seed" instead of hit or match if you think that fits better for one of the terms.
Ok, now everything is clear. I approve a merge.
By the way, does a "seed" correspond to a "hit" or a "match"? Or do you use the term loosely and it be either?
Sorry for writing seed. I am not a 100% sure there is an agreed definition of a seed
in the literature. If there is, my guess is that a seed implies it's a match (to extend from..). Let us stick with your terminology of hit and match.
And there’s no reason k cannot be larger than n.
Agreed. Also, we could theoretically have up to nk
NAMs (i.e. we are not able to merge any match).
I’ve removed the last commit that does the renaming because that will add some conflicts with the multi-context seeds branch. I’ll open a separate PR that only does the renaming. We can then merge that one after mcs.
With this, some extra diagnostics are printed when verbose output is requested with
-v
:I’m using this to figure out why the multi-context seeds branch makes strobealign slower.