ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
150 stars 17 forks source link

Add some diagnostics #438

Closed marcelm closed 1 month ago

marcelm commented 2 months ago

With this, some extra diagnostics are printed when verbose output is requested with -v:

Number of reads: 1000000
Number of non-rescue hits: 25160733 total. Per read: 25.16
Number of non-rescue NAMs: 27321159 total. Per read: 27.32
Number of rescue hits: 583028 total. Per rescue attempt: 23.07
Number of rescue NAMs: 11457402 total. Per rescue attempt: 453.27

I’m using this to figure out why the multi-context seeds branch makes strobealign slower.

marcelm commented 2 months ago

I need to check whether the "Number of rescue NAMs" figure is actually correct as it is so high.

marcelm commented 2 months ago

One problem appears to be that we have two different types of "hits".

The first type of hit is a randstrobe successfully looked up in the index, represented by the start position in the StrobemerIndex::randstrobes vector (first entry with the same hash as the randstrobe).

In find_nams, for example, nr_good_hits and total_hits are used with that meaning: https://github.com/ksahlin/strobealign/blob/a5bd1fd8f1b450109849d8e1910c1d8e250fafd7/src/nam.cpp#L165-L173

Also, the RescueHit (in the same file) is such a hit.

The second type is an interval on the query paired with an interval on the reference. This is, in particular, the Hit struct: https://github.com/ksahlin/strobealign/blob/a5bd1fd8f1b450109849d8e1910c1d8e250fafd7/src/nam.cpp#L5-L10

You get a list of such hits when one takes a hit according to the first definition and iterates through all index positions at which the hash matches.

This meaning of hit is used in, for example, merge_hits_into_nams, add_to_hits_per_ref, Nam::n_hits.

The new statistics output in this PR uses the first meaning of hit, but I thought it was the second. With the second meaning, we would expect to see fewer NAMs then hits (because hits are merged into NAMs), which made me suspicious, but since it is the first meaning, this does not have to be the case.

Any suggestions for a better name for one of the two variants of hits?

ksahlin commented 2 months ago

For printing diagnostics, maybe we could use distinct hash matches for the first type and seed/hit locations for the second type? These are not ideal variable names if you refer to changing variable names. I think 'position/location' or 'pos/loc' should be included in the second as it is the characteristic that separates them.

marcelm commented 1 month ago

I thought about this some more and have a suggestion that would feel quite natural to me and requires only minimal changes: We simply rename the Hit struct to Match.

That is, we continue using the term "hit" for strobemer successfully looked up in the index and then use the term "match" for a locus on the query paired with a locus on the reference.

I think this would be quite nice because we merge the latter entities into NAMs, and then we could say that we merge multiple overlapping matches into non-overlapping, approximate matches (=NAMs).

I think the logging output could actually stay as it is (it is only shown with -v anyway and is mostly useful for us). But if we want, we could extend it print the number of matches:

Number of reads: 1000000
Number of non-rescue hits: 25160733 total. Per read: 25.16
Number of non-rescue matches: xyz total. Per read: xx.yy
Number of non-rescue NAMs: 27321159 total. Per read: 27.32
Number of rescue hits: 583028 total. Per rescue attempt: 23.07
Number of rescue matches: xyz total. Per rescue attempt: xx.yy
Number of rescue NAMs: 11457402 total. Per rescue attempt: 453.27

I tried to use "distinct hash matches" and "query randstrobes found in index", but it feels a bit contorted. And these messages are not self-explanatory anyway, so they might as well be concise and use clearly defined terms.

I took the liberty to push a commit that makes the above change so that you can see for yourself (if you want).

ksahlin commented 1 month ago

Ok, I like the suggestion.

Some Qs:

  1. Does this mean that a hit does not become a match only if the hit was too repetitive (filtered out)?
  2. is Number of non-rescue hits incremented with 1 or with count(hit) for each seed found in the index?
  3. How can Number of non-rescue NAMs be larger than Number of non-rescue hits in the numbers above? Assuming you haven't made up the numbers in the example. (related to Q2?)
marcelm commented 1 month ago
  1. Does this mean that a hit does not become a match only if the hit was too repetitive (filtered out)?

Yes. (The filtering criteria on which this depends are of course different for rescue and non-rescue.)

  1. is Number of non-rescue hits incremented with 1 or with count(hit) for each seed found in the index?

By 1.

  1. How can Number of non-rescue NAMs be larger than Number of non-rescue hits in the numbers above? Assuming you haven't made up the numbers in the example. (related to Q2?)

This is what confused me as well and made me wonder whether the numbers are correct. But with what I answered for Q2, I think it’s clear:

And there’s no reason $k$ cannot be larger than $n$.

I think if we also report the number of matches (as an additional line between the number of of hits and NAMs), it may be a bit more obvious what is going on: The numbers go up from hit to match and down again because matches are merged.

marcelm commented 1 month ago
  1. [...] for each seed found in the index?

By the way, does a "seed" correspond to a "hit" or a "match"? Or do you use the term loosely and it be either?

It would also be totally fine with me to use "seed" instead of hit or match if you think that fits better for one of the terms.

ksahlin commented 1 month ago

Ok, now everything is clear. I approve a merge.

By the way, does a "seed" correspond to a "hit" or a "match"? Or do you use the term loosely and it be either?

Sorry for writing seed. I am not a 100% sure there is an agreed definition of a seed in the literature. If there is, my guess is that a seed implies it's a match (to extend from..). Let us stick with your terminology of hit and match.

And there’s no reason k cannot be larger than n.

Agreed. Also, we could theoretically have up to nk NAMs (i.e. we are not able to merge any match).

marcelm commented 1 month ago

I’ve removed the last commit that does the renaming because that will add some conflicts with the multi-context seeds branch. I’ll open a separate PR that only does the renaming. We can then merge that one after mcs.