find_str returns false positives

mfronczyk commented 10 years ago

It seems that for UTF-8 strings the find_str method can return the byte index of the first matching substring, but there is no matching substring in fact. Here's the code:

let m = "śśdóahżłjnckpyzjah";
let n = "hah";
println!("({})", m.as_bytes());
println!("({})", n.as_bytes());
let pos = m.find_str(n).unwrap();
let substring = m.slice(pos, pos + 3).as_bytes();
println!("{}", substring);

Compiled with "rustc parser.rs"

And it returns:

([197, 155, 197, 155, 100, 195, 179, 97, 104, 197, 188, 197, 130, 106, 110, 99, 107, 112, 121, 122, 106, 97, 104])
([104, 97, 104])
[106, 97, 104]

Which indicates that the string "hah" was found at the end of the m string, but it isn't there - the end is "jah".

Rust version: rustc 0.12.0-pre-nightly (2e92c67dc 2014-08-28 23:56:20 +0000) 64bit OS: OS X 10.9.4

alexcrichton commented 10 years ago

cc @nham

nham commented 10 years ago

Pretty weird! This also returns true: "1234567ah012345678901ah".contains("hah"). Also, if you delete any of the numbers it gives the correct result. Also gives the correct result if you change the first 'a' or 'h'.

huonw commented 10 years ago

Is there value in switching to a simpler algorithm?

nham commented 10 years ago

@huonw That was my thought as well. What's the fastest alternative to Two-Way thats simple to implement? Boyer-Moore-Horspool?

huonw commented 10 years ago

I don't know about fastest; would be worth researching.

In any case, we currently have an algorithm that takes infinite time to get the right answer for some inputs, so even temporarily switching back to a naive search would be a speed-up (correctness trumps performance :) ).

mfronczyk commented 10 years ago

In addition to being buggy, the current implementation of find_str seems to be very inefficient. The following Rust code:

for a in range(0i, 5) {
    let mut file = BufferedWriter::new(File::create(&Path::new("matches_rust.txt")));
    for auction in auctions.iter() {
        let mut i = -1;
        for ts in terms.iter() {
            i += 1;
            for t in ts.iter() {
                if auction.as_slice().contains(t.as_slice()) { // uses find_str
                    file.write_uint(i);
                    file.write(b" ");
                    break;
                }
            }
        }
        file.write(b"\n");
    }
}

needs about x1.6 time more than it's C++ (Apple LLVM version 5.1 (clang-503.0.40) x86_64-apple-darwin13.3.0) equivalent:

for (int a = 0; a < 5; ++a) {
  ofstream matches_file("matches_cpp.txt", ios::out | ios::binary);

  for (const string& auction : auctions) {
    int i = -1;
    for (const auto& ts : terms) {
      ++i;
      for (const auto& s : ts) {
        if (auction.find(s) != string::npos) {
          matches_file << i << " ";
          break;
        }
      }
    }
    matches_file << endl;
  }
}

I was thinking that it was because Rust is just much slower than C++, but what makes me think that it's find_str's fault is the fact that when I change the "contains" condition to something like this in both programs:

if (auction.as_slice().len() + t.as_slice().len()) % 10 == 0 {

the Rust version (surprisingly?) becomes faster - the C++ one needs x2.5 more time to execute. The average length of the "auction" string is 70 and the "t" string is 7 on average.

nham commented 10 years ago

I'm far from an expert on string matching algorithms, but after a bit of googling I found this paper which benchmarks 85 different algorithms. I'm currently attempting to implement and benchmark three of them since they seem both generally fast and simple to implement:

Horspool's
Quick Search
Tuned Boyer-Moore

alex commented 10 years ago

The algorithm python uses for string search is pretty simple and pretty fast: http://effbot.org/zone/stringlib.htm

mfronczyk commented 10 years ago

It seems to me that the naive algorithm with some optimizations (loop unrolling, code inlining, etc) gives the best results. Here's the libc++ implementation that turned out to be much faster than the Rust's one in my tests:

http://llvm.org/svn/llvm-project/libcxx/trunk/include/algorithm (look for __search)

mfronczyk commented 10 years ago

@nham, the paper you referred to seems to compare algorithms that find ALL occurrences, but we most of the time need to find only one. I'm not sure if implementing any of the algorithms you proposed in the standard library makes sense. I'd suggest using the naive algorithm, but optimize the code (unroll loops, inline code, write some unsafe code to disable bounds checks if necessary, etc).

However, the algorithm used in Python might be applicable.

tmyklebu commented 10 years ago

Please don't switch a programming environment's string matcher to anything with a quadratic worst-case. It is better to debug your implementation of two-way matching until it works.

If you really, really want to switch your two-way matcher for something "simpler", the simplest thing I know that has a linear worst-case time complexity and constant storage overhead works like this:

In linear time and constant space, find the lexicographically largest suffix of the pattern. Write the pattern as uv, where v is the lex-largest suffix and u is the rest of the string.
Find occurrences of v, again in linear time and constant space. (This is possible because v is lex-largest.)
If you find a v and it's at least len(u) characters after the last occurrence of v, check, using memcmp, that u precedes it. If it does, return a match; if not, do nothing.

Step 1 is nontrivial, but it's slightly easier than two-way matching. It is Algorithm MS in Crochemore and Rytter's free "Text Algorithms" book. Steps 2 and 3 look exactly the same as two-way matching.

Again, two-way matching is the way to go. Please use two-way matching for your stock string matcher, not a hacky implementation of a naive matcher or something that requires an auxiliary table!

mfronczyk commented 10 years ago

Good idea. Maybe it's slow because it's broken.

Ideally, the performance of the implementation should be comparable to the C++ one for searching for short patterns.

I can perform my tests again when you fix it.

nham commented 10 years ago

@tmyklebu Thanks for the pointer to the Text Algorithms book, this seems much more readable than the paper. I'm going to give debugging Two-Way a shot.

nham commented 10 years ago

"00abc01234567890123456789abc".contains("bcabc") is another example that currently comes up true, for future reference.

bluss commented 10 years ago

maybe randomized testing using quick check could help here. If it can easily come up with examples for the this bug.

rust-lang / rust

find_str returns false positives #16878