nishihatapalmer / byteseek

A Java library for byte pattern matching and searching
BSD 3-Clause "New" or "Revised" License
40 stars 11 forks source link
pattern-matching regular-expression search-algorithm searching-algorithms

byteseek

byteseek is a Java library for efficiently matching patterns of bytes and searching for those patterns. Source code can be found at Github in the byteseek repository. Published releases of byteseek are also available on maven central. The main well-tested packages are:

Matcher

A package which contains various types of matcher for individual bytes or sequences of them.

Searcher

A package which contains implementations of various search algorithms. Most of them are sub-linear, which means they don't have to examine every position in an input source to find all possible matches. All the search algorithms have been extended to work with sequences which can match more than one byte at a given position. Any sequence search algorithm can work with any sequence matcher, no matter how it is composed. All the search implementations are stream-friendly - the length of an input source is not required unless you explicitly want to work at the end of an input source.

IO

Matchers and searchers can all work over byte arrays directly. In order to read efficiently from any other input source, readers provide a consistent random-access interface over files, input streams, strings and byte arrays. Pluggable caching strategies allow tailoring the memory and performance for different use cases.

This package may be generally useful, independent of byteseek matching and searching.

Parser

A byte-oriented regular expression language is given to allow the easy construction of byte matchers, sequence matchers, and (eventually) finite state automata. An abstract syntax tree is defined, so other regular expression syntaxes could be used if required.

Compiler

A package which contains compilers for all of the matchers from an abstract syntax tree.

Untested

Various other packages exist which are not currently tested, but will become so eventually. These include:

Matcher

Searcher

Compiler

Regular expressions are constructed as Glushkov finite state automata, rather than the more common Thompson construction. Glushkov automata are generally more compact and have no empty transitions, which can improve performance and makes them simpler to work with.

The normal construction for Glushkov automata involves a somewhat complex and recursive analysis stage. In byteseek, we construct a Glushkov automata directly from the abstract syntax tree, similarly to the Thompson construction but avoiding the need for any empty transitions. It has been adapted and extended from the method given in the paper below:

"A reexamination of the Glushkov and Thompson Constructions", by Dora Giammarresi, Jean-Luc Ponty, Derick Wood, 1998.

Automata