Open UnixJunkie opened 10 years ago
There is an implementation of the Knuth-Morris-Pratt string searching algorithm there in OCaml: http://gallium.inria.fr/blog/kmp/
Some interesting literature for Christmas: "Experimental Results on String Matching Algorithms" www-igm.univ-mlv.fr/~lecroq/articles/spe95.pdf
It's Boyer-Moore by the way, I always misspell it.
There is a rotting pull request here: https://github.com/ocaml-batteries-team/batteries-included/pull/377 For a different algorithm.
I'm interested in implementing faster string searching. Is this still useful?
I can implement different algorithms and compare their performance, but every algorithm has its advantages. I don't think we can afford too much preprocessing in String.find
, because this will degrade performance on short patterns. Two-way search or a simple hash (e.g. the sum of the characters) seems promising.
For more optimized algorithms, another function will be necessary (e.g. String.fast_find pattern text
), such that the preprocessing can be done separately (let finder = String.fast_find pattern
) from the searching (finder text
). Possible candidates are Knuth-Morris-Pratt, Boyer-Moore, Horspool, and all other variants, but I would prefer to keep it simple.
Another possibility is to choose an algorithm based on the size of the pattern and the text.
I think that this could be a useful thing to have, but also suspect that it does not need to be inside Batteries: you could probably have more users for the library without a Batteries dependency, and you don't really need the support of the existing Batteries functionality. You could have a library that works on string
, bytes
and maybe char Bigarray
, and provide an interruptible interface (if you reach the end of the string and need more input, return the current search state reified as a value to let users restart search after they get more input) to let people devise their own streaming/buffering strategies.
A new library for fast string searching would be nice too. Maybe I will do that if I find the time :stuck_out_tongue:
But there already is a function String.find
in Batteries, and it's slow. So would you consider replacing it if there is a simple alternative that is faster on every input?
I think there already are libraries on opam for string searching (e.g. "xstr", probably "stringext" — not sure —, "astring"…). You can also try to adapt my code from containers that uses KMP in general, and String.find
for patterns of length one.
All those libraries (except containers) use the naive algorithm.
I think that a library dedicated to string searching algorithms could be an occasion to have this specialized feature in a central place, independently from a "big library" (Batteries, Containers, Core, you name it), and compare algorithms.
@c-cube: do you have any idea of what is the performance difference between using your KMP implementation and ocaml-re
's string
combinator? (For example, on a program that reads a file line by line, uses the search algorithm to tell if the string " let "
appears in the string, and report the number of matching lines.)
I have no idea, though it would be interesting. I have a few benchmarks inside containers for KMP vs naive algorithm, though. A separate library would make sense (there is also a functorized KMP in containers, btw — for searching in bigarrays, arrays of scalars, etc.).
@314eter If you make a separate library for string searching on github, under a reasonably permissive license (not GPL; LGPL, MIT, BSD, etc. ok) I'd be happy to contribute my code.
I think for batteries we need two functions:
val compile: string -> t (* to prepare the string for fast search *)
val find: ~offset:int -> t -> string -> int (* to find it in another string *)
Boyer-Moore looks good enough for most people I think.
@314eter let's keep it simple ;)
@UnixJunkie the interface you propose is okay, but I think it would also be important to have a function that makes search incremental/interruptible: if at the end of the string there is no match, but there is a potential match in progress, and then I start searching in a new string (morally the continuation of the previous input), how can I make the algorithm start exactly in the state it was in?
This would call for an additional interface about as follows:
type search_state
val init : search_state
type find_result = Found of int | Eof of search_state
val find_incremental : ~offset:int -> t -> string -> search_state -> find_result
@UnixJunkie Boyer-Moore is not simple ;)
My Horspool implementation is faster than KMP on every input I tested and faster than naive on everything except for very short patterns. Boyer-Moore has a better worst-case complexity, but I suspect that it will be slower in most cases.
If you are interested to get your implementation in Batteries, feel free to open a pull request. We have a benchsuite
folder in which we include performance comparisons of various implementations, for others to reproduce and future reference, it would be nice to have a file for string searching in there. See for example benchsuite/bench_nreplace
for inspiration.
If there is a quick test you can perform in the String.{,r}find{,_from}
implementation to dynamically switch to a less naive implementation (I'd guess testing both the pattern length and the string length), it is good -- users get faster code without upgrading. You should also explicitly expose the algorithm, I think, for people that want to access the precomputation interface.
I implemented and benchmarked a few string searching algorithms. Unfortunately, there is no clear winner.
There are ocaml implementations of Boyer-Moore in there: https://github.com/TWal/ENS_boyermoore/tree/master/ocaml
I have a decent implementation of KMP in containers (with special case for 1-char patterns), and would be interested in benchmarking it against other implementations. It definitely goes faster than the naive string search in my benchmarks.
I compared the Boyer-Moore implementation (moore
) and the KMP of containers (containers
) with my implementations (boyermoore
and kmp
).
They perform very comparable. Containers is slower, but that's probably because it's more flexible (searching in two directions).
What is the vertical axis measuring (is a higher position better or worse?). What are the length of the searched texts? We are interested in the behavior on large texts (are those functions usable for real-world data processing?), but also in the behavior on fairly small texts (length 10 or so), as it naturally occurs when looking for a word or (unicode?) symbol in, say, URLs or filesystem paths. Which of those benchmarks are indicative of the performance on small texts?
An explanation, the code, and more algorithms can be found here.
The vertical axis measures the time to count the number of occurrences of the pattern in the text. Lower is better.
The benchmarks are large texts, but all the algorithms are linear in the length of the text. For very small texts, the relative overhead of the pattern compilation will be too high (but not if you're looking for the same pattern in a lot of different small texts), but I didn't benchmark it. The problem is that it's difficult to find good benchmarks that cover all possible cases. Suggestions for more benchmarks are very welcome.
Thanks for the benchmarks! There is also the one-char case which can be optimized by falling back to String.index
, btw.
I think we still miss a Bayer-Moore kind of string search in batteries.