<functional>: Investigate other algorithms for Boyer-Moore delta2

microsoft / STL

MSVC's implementation of the C++ Standard Library.

Other

10.23k stars 1.51k forks source link

<functional>: Investigate other algorithms for Boyer-Moore delta2 #727

Open StephanTLavavej opened 4 years ago

StephanTLavavej commented 4 years ago

The Rytter correction implemented in #724 may not be the end of the saga. As mentioned in https://github.com/microsoft/STL/pull/724#issuecomment-615047283 , Mehlhorn developed a simpler correction (according to Knuth), published by Smit in 1982 and (apparently) further corrected (!!!) by Mzali and Thiel in 1983. Then, https://github.com/microsoft/STL/pull/724#issuecomment-615246820 mentioned a paper by Wang, proposing a potentially superior algorithm (a brief scan indicates that it might be more code than what we currently have, but with possibly different performance characteristics), and also provides Aho's improved version of KMP77.

We may want to investigate these other algorithms, especially if they result in faster table construction for certain patterns. Although #724 enhances our test coverage, extreme caution is needed, given how paper after paper has gotten this wrong.

mozjag commented 4 years ago

I don't think Yang claims the algorithm is superior in terms of performance, but it should be on par. It's rather a different way to end up with the same shift table, but in a way that's easier to understand. Quoting the abstract from the Yang paper:

The Boyer-Moore’s string matching algorithm uses two pre-computed tables skip, which utilizes the occurrence heuristic of symbols in a pattern, and shift, which utilizes the match heuristic of the pattern, for searching a string. Some experts have pointed out that the difficulty of understanding the computation of the shift table has hindered utilization of the algorithm in a wider range of applications. This paper describes an alternative way to compute the shift table. We believe that the new method is more intuitive and straightforward both conceptually and logically than the original method, and thus, easier to understand and to implement. Also, the new method has a O(m) complexity in both required space and time for a pattern of length m. Therefore, it preserves the high performance of the Boyer-Moore algorithm.

nimish commented 4 years ago

There are a bunch of other modifications to boyer-moore in the literature e.g. Apostolico-Giancarlo ; for short patterns Knuth-Morris-Pratt can be faster: https://filipjaniszewski.wordpress.com/2016/03/03/string-searching-algorithms-comparison/

shreevatsa commented 4 years ago

Just to elaborate on https://github.com/microsoft/STL/pull/724#issuecomment-615047283 where I referred to the addendum from the book: there I gave only the first and last of the relevant paragraphs but it occurs to me that the book may not be easy to find (especially at this time), so here is the full addendum (part of page 133, and pages 134 and 135), in case it helps. Apologies for including the text only as images (though maybe it's better that I don't type it in and potentially introduce further errors!). Disclaimer: My only knowledge here is of the fact that Knuth's papers have been collected into books; I haven't even read the paper carefully :-)

I've also checked that there are no relevant errata for these pages on the book's webpage, which, given the scrutiny that Knuth's books receive (by himself and others), might give some confidence. Going by this addendum it looks like Ole-Johan Dahl's program (which as far as I can tell is only published here on page 134) is both simpler than the original, and correct.

StephanTLavavej commented 4 years ago

@nimish, Knuth-Morris-Pratt and other string searching algorithms are indeed important for application programmers to be aware of so they can choose wisely. However, in this C++ Standard Library implementation, we are concerned only with implementing Boyer-Moore and Boyer-Moore-Horspool specifically, because only they were Standardized (in addition to the naive default search). Improved table construction algorithms for Boyer-Moore are therefore relevant, but entirely different searching algorithms are not (unless and until they are added to the C++ Standard). :smile_cat:

nimish commented 4 years ago

@nimish, Knuth-Morris-Pratt and other string searching algorithms are indeed important for application programmers to be aware of so they can choose wisely. However, in this C++ Standard Library implementation, we are concerned only with implementing Boyer-Moore and Boyer-Moore-Horspool specifically, because only they were Standardized (in addition to the naive default search). Improved table construction algorithms for Boyer-Moore are therefore relevant, but entirely different searching algorithms are not (unless and until they are added to the C++ Standard). 😸

Good to know -- what an odd choice by the committee to standardize a specific algorithm instead of its effects.

Given the other std requirements for first match only, you just need the Galil's rule to avoid the pathological quadratic case. A-G is an extension for multiple results.

no-identd commented 4 years ago

@nimish led to( the below) by your comment referencing the blog post by @fjanisze; let me drop a few additional text books y'all might wish to check this whole matter against:

https://www.worldscientific.com/worldscibooks/10.1142/4838 Crochemore, Maxime; Rytter, Wojciech — Jewels of Stringology: Text Algorithms (2002)

http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521848992 Crochemore, Maxime; Hancart, Christophe; Lecroq, Thierry — Algorithms on Strings (2007)

https://www.cambridge.org/core/books/125-problems-in-text-algorithms/47603814E26E3F28C4BF01AF2B7978DC Crochemore, Maxime; Lecroq, Thierry; Rytter, Wojciech — 125 Problems in Text Algorithms (with Solutions) (2021)

Perhaps there lurk more badly (cross)referenced wonders & sights of algorithmic bewilderment buried within their reference/bibliography sections. Or perhaps nothing new.

I'd—try to, and most likely fail miserably—check for you, but alas, that'd entail significantly exceeding both my monetary & my time budget. 🤡