Fix index construction statistics

Itolstoganov commented 2 months ago

Fixed several issues in the index statistics

Since get_count(size_t position) counts the seed abundance starting from the argument position, get_count is now used with the first occurrence position
Fixed median seed length
Clarified statistics text

ksahlin commented 2 months ago

Hi Ivan,

Thanks - and great that you found this bug!

I approve the PR, but I will let Marcel make the final call.

@marcel, note that auto count = get_count(find(get_hash(it))); is a bit of a redundant call, since it involves two searches. A faster way to do it would be skipping over all seeds with the same hash and increment the counters differently:

            tot_seed_count += count;
            tot_seed_count_sq += count^2;

However, this part of the code is only for printing index statistics, therefore it is not crucial for it to be optimised.

Hence I approve.

marcelm commented 1 month ago

I’ll merge this so that it can be part of the next release. To be honest, I’ve never used print_diagnostics, so it being inefficient doesn’t affect me that much, and I guess it’s rarely used in practice anyway.

ksahlin / strobealign

Fix index construction statistics #434