Open sageserpent-open opened 2 months ago
Another idea - further parallelization of fingerprinting. At time of writing (main
at: f3b353220f1fad04064b4dc99b2d16210e72b71e), fingerprinting is executed in parallel across files
This could be extended so that for large files, fingerprinting takes place in parallel swathes. Although this means that each swathe needs to be primed beforehand (which is optimised away when the entire file is fingerprinted in one swathe), for large files one hopes that being able to parallelize outweighs this overhead.
Another idea - can the fingerprinting done by RollingHash
use plain Long
arithmetic as well as BigInt
. I wonder whether this could be used either when looking for fingerprints for a small match window size, or perhaps as a sloppy approximation if the target files to be compared are all large (we have to make sure that fingerprints being compared across files are all computed in the same way to infer potential matches - well actually, we don't, but we may end up with a lot of spurious potential matches. Hmm...).
Worth doing some spikes on this idea...
This follows on from #34, trying to win some more performance improvement.
The first thing to investigate is whether the subsumption and / or overlap tests for sections that form part of a potential match can be carried out quicker using an alternative implementation from the existing fingertree one - there are other data structures for this, such as interval trees and segment trees.