ocaml-batteries-team / batteries-included

Batteries Included project
http://ocaml-batteries-team.github.com/batteries-included/hdoc2/
Other
521 stars 106 forks source link

fast string searching #484

Open UnixJunkie opened 10 years ago

UnixJunkie commented 10 years ago

I think we still miss a Bayer-Moore kind of string search in batteries.

UnixJunkie commented 10 years ago

There is an implementation of the Knuth-Morris-Pratt string searching algorithm there in OCaml: http://gallium.inria.fr/blog/kmp/

UnixJunkie commented 10 years ago

Some interesting literature for Christmas: "Experimental Results on String Matching Algorithms" www-igm.univ-mlv.fr/~lecroq/articles/spe95.pdf‎

UnixJunkie commented 10 years ago

It's Boyer-Moore by the way, I always misspell it.

UnixJunkie commented 10 years ago

There is a rotting pull request here: https://github.com/ocaml-batteries-team/batteries-included/pull/377 For a different algorithm.

314eter commented 8 years ago

I'm interested in implementing faster string searching. Is this still useful?

I can implement different algorithms and compare their performance, but every algorithm has its advantages. I don't think we can afford too much preprocessing in String.find, because this will degrade performance on short patterns. Two-way search or a simple hash (e.g. the sum of the characters) seems promising.

For more optimized algorithms, another function will be necessary (e.g. String.fast_find pattern text), such that the preprocessing can be done separately (let finder = String.fast_find pattern) from the searching (finder text). Possible candidates are Knuth-Morris-Pratt, Boyer-Moore, Horspool, and all other variants, but I would prefer to keep it simple.

Another possibility is to choose an algorithm based on the size of the pattern and the text.

gasche commented 8 years ago

I think that this could be a useful thing to have, but also suspect that it does not need to be inside Batteries: you could probably have more users for the library without a Batteries dependency, and you don't really need the support of the existing Batteries functionality. You could have a library that works on string, bytes and maybe char Bigarray, and provide an interruptible interface (if you reach the end of the string and need more input, return the current search state reified as a value to let users restart search after they get more input) to let people devise their own streaming/buffering strategies.

314eter commented 8 years ago

A new library for fast string searching would be nice too. Maybe I will do that if I find the time :stuck_out_tongue:

But there already is a function String.find in Batteries, and it's slow. So would you consider replacing it if there is a simple alternative that is faster on every input?

c-cube commented 8 years ago

I think there already are libraries on opam for string searching (e.g. "xstr", probably "stringext" — not sure —, "astring"…). You can also try to adapt my code from containers that uses KMP in general, and String.find for patterns of length one.

314eter commented 8 years ago

All those libraries (except containers) use the naive algorithm.

gasche commented 8 years ago

I think that a library dedicated to string searching algorithms could be an occasion to have this specialized feature in a central place, independently from a "big library" (Batteries, Containers, Core, you name it), and compare algorithms.

@c-cube: do you have any idea of what is the performance difference between using your KMP implementation and ocaml-re's string combinator? (For example, on a program that reads a file line by line, uses the search algorithm to tell if the string " let " appears in the string, and report the number of matching lines.)

c-cube commented 8 years ago

I have no idea, though it would be interesting. I have a few benchmarks inside containers for KMP vs naive algorithm, though. A separate library would make sense (there is also a functorized KMP in containers, btw — for searching in bigarrays, arrays of scalars, etc.).

@314eter If you make a separate library for string searching on github, under a reasonably permissive license (not GPL; LGPL, MIT, BSD, etc. ok) I'd be happy to contribute my code.

UnixJunkie commented 8 years ago

I think for batteries we need two functions:

val compile: string -> t (* to prepare the string for fast search *)
val find: ~offset:int -> t -> string -> int (* to find it in another string *)

Boyer-Moore looks good enough for most people I think.

UnixJunkie commented 8 years ago

@314eter let's keep it simple ;)

gasche commented 8 years ago

@UnixJunkie the interface you propose is okay, but I think it would also be important to have a function that makes search incremental/interruptible: if at the end of the string there is no match, but there is a potential match in progress, and then I start searching in a new string (morally the continuation of the previous input), how can I make the algorithm start exactly in the state it was in?

This would call for an additional interface about as follows:

type search_state
val init : search_state
type find_result = Found of int | Eof of search_state
val find_incremental : ~offset:int -> t -> string -> search_state -> find_result
314eter commented 8 years ago

@UnixJunkie Boyer-Moore is not simple ;)

My Horspool implementation is faster than KMP on every input I tested and faster than naive on everything except for very short patterns. Boyer-Moore has a better worst-case complexity, but I suspect that it will be slower in most cases.

gasche commented 8 years ago

If you are interested to get your implementation in Batteries, feel free to open a pull request. We have a benchsuite folder in which we include performance comparisons of various implementations, for others to reproduce and future reference, it would be nice to have a file for string searching in there. See for example benchsuite/bench_nreplace for inspiration.

If there is a quick test you can perform in the String.{,r}find{,_from} implementation to dynamically switch to a less naive implementation (I'd guess testing both the pattern length and the string length), it is good -- users get faster code without upgrading. You should also explicitly expose the algorithm, I think, for people that want to access the precomputation interface.

314eter commented 8 years ago

I implemented and benchmarked a few string searching algorithms. Unfortunately, there is no clear winner.

UnixJunkie commented 8 years ago

There are ocaml implementations of Boyer-Moore in there: https://github.com/TWal/ENS_boyermoore/tree/master/ocaml

c-cube commented 8 years ago

I have a decent implementation of KMP in containers (with special case for 1-char patterns), and would be interested in benchmarking it against other implementations. It definitely goes faster than the naive string search in my benchmarks.

314eter commented 8 years ago

I compared the Boyer-Moore implementation (moore) and the KMP of containers (containers) with my implementations (boyermoore and kmp). benchmark

They perform very comparable. Containers is slower, but that's probably because it's more flexible (searching in two directions).

gasche commented 8 years ago

What is the vertical axis measuring (is a higher position better or worse?). What are the length of the searched texts? We are interested in the behavior on large texts (are those functions usable for real-world data processing?), but also in the behavior on fairly small texts (length 10 or so), as it naturally occurs when looking for a word or (unicode?) symbol in, say, URLs or filesystem paths. Which of those benchmarks are indicative of the performance on small texts?

314eter commented 8 years ago

An explanation, the code, and more algorithms can be found here.

The vertical axis measures the time to count the number of occurrences of the pattern in the text. Lower is better.

The benchmarks are large texts, but all the algorithms are linear in the length of the text. For very small texts, the relative overhead of the pattern compilation will be too high (but not if you're looking for the same pattern in a lot of different small texts), but I didn't benchmark it. The problem is that it's difficult to find good benchmarks that cover all possible cases. Suggestions for more benchmarks are very welcome.

c-cube commented 8 years ago

Thanks for the benchmarks! There is also the one-char case which can be optimized by falling back to String.index, btw.