Further performance improvements.

sageserpent-open / kineticMerge

Merge a heavily refactored codebase and stay sane.

MIT License

9 stars 1 forks source link

Further performance improvements. #35

Open sageserpent-open opened 2 months ago

sageserpent-open commented 2 months ago

This follows on from #34, trying to win some more performance improvement.

The first thing to investigate is whether the subsumption and / or overlap tests for sections that form part of a potential match can be carried out quicker using an alternative implementation from the existing fingertree one - there are other data structures for this, such as interval trees and segment trees.

sageserpent-open commented 1 month ago

Another idea - further parallelization of fingerprinting. At time of writing (main at: f3b353220f1fad04064b4dc99b2d16210e72b71e), fingerprinting is executed in parallel across files

This could be extended so that for large files, fingerprinting takes place in parallel swathes. Although this means that each swathe needs to be primed beforehand (which is optimised away when the entire file is fingerprinted in one swathe), for large files one hopes that being able to parallelize outweighs this overhead.

sageserpent-open commented 2 days ago

Another idea - can the fingerprinting done by RollingHash use plain Long arithmetic as well as BigInt. I wonder whether this could be used either when looking for fingerprints for a small match window size, or perhaps as a sloppy approximation if the target files to be compared are all large (we have to make sure that fingerprints being compared across files are all computed in the same way to infer potential matches - well actually, we don't, but we may end up with a lot of spurious potential matches. Hmm...).

Worth doing some spikes on this idea...