Closed mfronczyk closed 10 years ago
cc @nham
Pretty weird! This also returns true: "1234567ah012345678901ah".contains("hah")
. Also, if you delete any of the numbers it gives the correct result. Also gives the correct result if you change the first 'a' or 'h'.
Is there value in switching to a simpler algorithm?
@huonw That was my thought as well. What's the fastest alternative to Two-Way thats simple to implement? Boyer-Moore-Horspool?
I don't know about fastest; would be worth researching.
In any case, we currently have an algorithm that takes infinite time to get the right answer for some inputs, so even temporarily switching back to a naive search would be a speed-up (correctness trumps performance :) ).
In addition to being buggy, the current implementation of find_str seems to be very inefficient. The following Rust code:
for a in range(0i, 5) {
let mut file = BufferedWriter::new(File::create(&Path::new("matches_rust.txt")));
for auction in auctions.iter() {
let mut i = -1;
for ts in terms.iter() {
i += 1;
for t in ts.iter() {
if auction.as_slice().contains(t.as_slice()) { // uses find_str
file.write_uint(i);
file.write(b" ");
break;
}
}
}
file.write(b"\n");
}
}
needs about x1.6 time more than it's C++ (Apple LLVM version 5.1 (clang-503.0.40) x86_64-apple-darwin13.3.0) equivalent:
for (int a = 0; a < 5; ++a) {
ofstream matches_file("matches_cpp.txt", ios::out | ios::binary);
for (const string& auction : auctions) {
int i = -1;
for (const auto& ts : terms) {
++i;
for (const auto& s : ts) {
if (auction.find(s) != string::npos) {
matches_file << i << " ";
break;
}
}
}
matches_file << endl;
}
}
I was thinking that it was because Rust is just much slower than C++, but what makes me think that it's find_str's fault is the fact that when I change the "contains" condition to something like this in both programs:
if (auction.as_slice().len() + t.as_slice().len()) % 10 == 0 {
the Rust version (surprisingly?) becomes faster - the C++ one needs x2.5 more time to execute. The average length of the "auction" string is 70 and the "t" string is 7 on average.
I'm far from an expert on string matching algorithms, but after a bit of googling I found this paper which benchmarks 85 different algorithms. I'm currently attempting to implement and benchmark three of them since they seem both generally fast and simple to implement:
The algorithm python uses for string search is pretty simple and pretty fast: http://effbot.org/zone/stringlib.htm
It seems to me that the naive algorithm with some optimizations (loop unrolling, code inlining, etc) gives the best results. Here's the libc++ implementation that turned out to be much faster than the Rust's one in my tests:
http://llvm.org/svn/llvm-project/libcxx/trunk/include/algorithm (look for __search)
@nham, the paper you referred to seems to compare algorithms that find ALL occurrences, but we most of the time need to find only one. I'm not sure if implementing any of the algorithms you proposed in the standard library makes sense. I'd suggest using the naive algorithm, but optimize the code (unroll loops, inline code, write some unsafe code to disable bounds checks if necessary, etc).
However, the algorithm used in Python might be applicable.
Please don't switch a programming environment's string matcher to anything with a quadratic worst-case. It is better to debug your implementation of two-way matching until it works.
If you really, really want to switch your two-way matcher for something "simpler", the simplest thing I know that has a linear worst-case time complexity and constant storage overhead works like this:
Step 1 is nontrivial, but it's slightly easier than two-way matching. It is Algorithm MS in Crochemore and Rytter's free "Text Algorithms" book. Steps 2 and 3 look exactly the same as two-way matching.
Again, two-way matching is the way to go. Please use two-way matching for your stock string matcher, not a hacky implementation of a naive matcher or something that requires an auxiliary table!
Good idea. Maybe it's slow because it's broken.
Ideally, the performance of the implementation should be comparable to the C++ one for searching for short patterns.
I can perform my tests again when you fix it.
@tmyklebu Thanks for the pointer to the Text Algorithms book, this seems much more readable than the paper. I'm going to give debugging Two-Way a shot.
"00abc01234567890123456789abc".contains("bcabc")
is another example that currently comes up true, for future reference.
maybe randomized testing using quick check could help here. If it can easily come up with examples for the this bug.
It seems that for UTF-8 strings the find_str method can return the byte index of the first matching substring, but there is no matching substring in fact. Here's the code:
Compiled with "rustc parser.rs"
And it returns:
Which indicates that the string "hah" was found at the end of the m string, but it isn't there - the end is "jah".
Rust version: rustc 0.12.0-pre-nightly (2e92c67dc 2014-08-28 23:56:20 +0000) 64bit OS: OS X 10.9.4