nishihatapalmer / byteseek

A Java library for byte pattern matching and searching
BSD 3-Clause "New" or "Revised" License
40 stars 11 forks source link

Implement HashChain search #29

Open nishihatapalmer opened 2 years ago

nishihatapalmer commented 2 years ago

HashChain is pretty much the fastest search for longer patterns. Implement it for byteseek.

It works well for low alphabets and high alphabets, and is tunable for different memory requirements. It's a factor search, most similar to WFR in that it uses a rolling hash construction. It creates chains of hashes in the hash table with a fingerprint that allows verification that the next hash is correct before doing another index lookup.

Need to figure out where its worst cases get triggered. Search time rises dramatically (like QF, and WFR). In practice it seems quite hard to trigger if the hash table is sufficiently empty or we aren't dealing with pathological low alphabet or repetitive data. It still does better than WFR or QF in these situations though.

Nevertheless, byteseek needs to know if it is appropriate to select this algorithm or whether another slower but without the same extreme worst case behaviour needs to be used instead. Or if the pattern can be modified (use a substring) that would avoid the pathological case? A pattern composed only of zeros might not do well with hash chain.

nishihatapalmer commented 2 years ago

NB. HashChain is not yet finished or published. I'm still working on fully understanding and optimising the hash table construction, and defining what tuning parameters are required (automate as far as possible, use should only select memory really, although knowledge of low/high alphabet or types of data to search (shannon entropy?) could influence selection of qgram length and bitshift parameters.