shepmaster / jetscii

A tiny library to efficiently search strings for sets of ASCII characters and byte slices for sets of bytes.
Apache License 2.0
113 stars 20 forks source link

expose more primitive operations #13

Open BurntSushi opened 8 years ago

BurntSushi commented 8 years ago

As I understand it, the public API currently requires that haystacks be &str and that all needles be ASCII. Might you consider exposing a way to run a search on a &[u8]? And possibly also exposing a way to use arbitrary bytes instead of limiting it to ASCII? (I feel less strongly about the latter point than the former point.)

The specific reason why I'd want this is for use in regex. In particular, the internal matching engines can work on either &str or &[u8], so the prefix literal scanning operates on &[u8] so that it can be used with any of the matching engines. Since all inputs originate with &str (currently), I could unsafely transmute to &[u8] before calling out to jetscii, but I don't think that assumption will always hold in the future and the use of unsafe there seems a little unfortunate.

Other ideas? Thoughts?

shepmaster commented 8 years ago

All of this absolutely makes sense. I actually thought there was an issue for this already, as I'm sure someone else wanted it.

run a search on a &[u8] a way to use arbitrary bytes instead of limiting it to ASCII

Yeah, I'd probably just set up parallel structures that allow you to do whatever you want with bytes. This would allow folk to use it for binary files where you need to find 255 or somesuch.

For my own education, will you actually be searching for non-ASCII characters? If so, are you just going to search for the first byte of a multiple-byte UTF8 character and then do a sanity check to make sure it was the right one?

BurntSushi commented 8 years ago

@shepmaster regex today does indeed search on bytes, which may be any value in 0-255. The invariants are little more complex than I'd like, but:

  1. For the DFA, it matches on raw bytes. UTF-8 decoding is handled implicitly in the compiled program. Since we guarantee (by construction) that the automata can only enter a match state via valid UTF-8, we can do whatever kind of prefix scanning we like---even if it's only part of a single encoded codepoint.
  2. For the NFAs, they generally do UTF-8 decoding explicitly since the automata is itself defined in terms of codepoints. (Yup, this means we have two distinct flavors of automata in regex.) When prefix literals are extract from such a program, then the literals are themselves just sequences of codepoints. When we do the actual prefix scanning, if the prefix starts with a non-ASCII codepoint, then it will just pass the first byte to memchr (for example). The matching engine then picks up at the beginning of the literal match, so it never needs to deal with partial decoding or any of that.

With that said...

  1. Using memchr on a non-ASCII byte probably isn't wise for performance reasons, and would probably be better to run memchr on, say, the ending byte of a UTF-8 encoding of a single codepoint. The performance picture isn't exactly clear here though.
  2. In the future, I would like for Regex to be capable of matching on &[u8], which may contain arbitrary bytes. Today, search text must be valid UTF-8.

Hope that makes sense. Happy to answer more questions!

shepmaster commented 8 years ago

@BurntSushi what kind of API would you like to see? Strings are nice in that they have the Matcher API to bolt into. Byte slices don't seem to have the same thing. There's a bit of setup involved, so there probably wants to be a struct to hold it. My natural inclination is to have something like

let needle = some_macro_name!(needle_byte_1, needle_byte_2, ...);
needle.position(haystack);

But it would probably feel more natural to have haystack.position(needle) which would involve a trait. I'm pretty flexible here, so as an interested API consumer, I think your voice gets a lot of weight! :palm_tree:

BurntSushi commented 8 years ago

@shepmaster Hmm. The absolute simplest API I can think of is: fn pcmpestri(needles: &[u8], haystack: &[u8]) -> Option<usize> (excuse me if pcmpestri is not the appropriate name here). It could panic if needles.len() > 16.

I suspect what you're going to say is that this adds some kind of overhead to every search since the bytes in needles need to be packed into something before being passed to asm!. If that's true, then perhaps:

struct PcmpSearcher {
    // some packed representation?
}

impl PcmpSearcher {
    // Create a new PcmpNeedles with the bytes given.
    // This panics if needles.len() > 16.
    // Duplicate bytes are ignored. (Seems suspicious.)
    fn new(needles: &[u8]) -> Self { ... }
    fn position(&self, haystack: &[u8]) -> Option<usize> { ... }
}

(Naming isn't my specialty. :P)

You could even impl FromIterator so that let pcmp: PcmpSearcher = vec![b'a', b'b', b'c'].into_iter().collect() works, but that might be too much fluff. :-)

Other thoughts:

  1. I could live with a trait, but there's really nothing in this space for &[u8], so my natural inclination is to not add any.
  2. I think a macro would be regrettable.
BurntSushi commented 8 years ago

@shepmaster Thanks for working with me on this btw. I actually think this might help regex catch up to PCRE on a few benchmarks (because they are precisely the benchmarks with a small number of prefix bytes). At least on Rust nightly, anyway. :-)

shepmaster commented 8 years ago

I think a macro would be regrettable.

The macro is to allow for fallback cases where the intrinsic operations aren't available. Technically, I could make a few small changes and release Jetscii right now and it would work on stable Rust. It just isn't worth it because the fallback would be the only supported case!

Right now, when you call something like:

part_number.split(ascii_chars!('-', ':')).collect();

That macro expands into the packed structures needed for the intrinsic as well as an optimized closure that checks for those two bytes "manually" (part_number.as_bytes().position(|b| b == '-' || b == ':')). Conditional compilation means only one path is actually used.

This all corresponds to AsciiChars and AsciiCharsWithFallback.

If you compile with the right features (which the docs don't), then there are methods on AsciiChars like find, which returns the byte index.

It's possible that you'd prefer to have your own fallback code for when the intrinsics aren't available. If so, then you'd be able to use the raw-bytes equivalent of find (which I agree should be called position in the context of bytes).


I did think about the one-shot function, and it's something I'd be open to if enough people wanted it, but I look at it as a bit of antipattern - you probably don't want to call it in a tight loop, much like Regex::new.

The FromIterator implementation is indeed cute, and I'll have to give some thought to that. :smile_cat:

shepmaster commented 8 years ago

Thanks for working with me on this btw.

Absolutely! Having people use the library is the only way it will get better! And the faster Rust is, the more likely I am to get paid to write it for a living! :innocent:

BurntSushi commented 8 years ago

Hmm, right, so then I definitely agree that the one-shot case should just be completely thrown out the window. Encouraging folks to write slow code is bad (and is one reason why, e.g., I want to remove the free function is_match from regex when it hits 1.0).

With that said... you're right about the fallback. I definitely want to be able to control, up front, which type of prefix searching is used. Sometimes it's just memchr, sometimes it's Boyer Moore and other times it's Aho-Corasick (and of course, the latter two make use of memchr internally, and hopefully jetscii too!). If for some reason the prefix isn't amenable to pcmpstri, then I want to choose a different strategy at compile time.

I think the fallback code is nice from a usability perspective, but I do also believe there should be a way to execute pcmpstri with as few assumptions as possible. I think that's what I'd like most.

Now, if for some reason it's impossible to know whether a fallback is needed up front, then that could be a problem! (I feel like there shouldn't be, but my understanding of the pcmpstri instruction is extremely limited.)

With all that said, if you feel strongly about tying it to a fallback in the public API, then I can cope and write my own fallback routine. :-) (For example, I'd use memchr2 in the memchr crate before the naive iterator approach if there are two bytes.)

shepmaster commented 8 years ago

I don't think I am doing a great job explaining, but hopefully code will help. Grab the bytes branch. You should be able to use jetscii::bytes::Bytes:

extern crate jetscii;

use jetscii::bytes::Bytes;

fn main() {
    let words = include_bytes!("/tmp/words.gz");
    let mut needle = Bytes::new();
    needle.push(0x80);
    println!("{:?}", needle.position(words));
}

Make sure you compile Jetscii with features=["unstable"] to enable everything (and a nightly, of course).

The big thing is that Bytes::position does not exist unless you

  1. Have the feature enabled
  2. Are on x86_64

The fallback code I'm talking about makes it so a consumer of the library doesn't need to know about those concerns. You want to care about that, and this is the current way I expose it.

BurntSushi commented 8 years ago

@shepmaster Code is good, thanks for that. And it has made me realize that I had a major brain fart. For whatever reason (I blame lack of sleep), I was thinking that the fallback was used when a caller added more than 16 bytes to the needle, but now of course I realize that's silly. I forgot about the pcmp instructions not being supported everywhere!

So with that said... Yes, I would use with_fallback and do my special memchr stuff in there. :-) Sorry for the confusion! Up to you whether you should still have position on the Bytes type.

shepmaster commented 8 years ago

I would use with_fallback and do my special memchr stuff in there

Hmm. That makes sense, but could be a bit tricky. Right now, the fallback takes a Fn(u8) -> bool which is usually provided with a closure like |b| b == b'A' || b == b'Z'. It's really impressive how well the compiler optimizes and inlines those!

In fact, a lot of my existing usage has centered around a set of characters fixed at compile time. For example, XML parsing can benefit from looking for the first of ['&', '<', '>']. You can see the benchmarks to see that an inlined closure is about 1 GB/s faster than a slice!

In your case, you'd rather a fallback that operates more like Fn(&[u8]) -> Option<usize>, basically a complete substitute implementation.

I can certainly add something like that, but it starts to feel a bit circular. What would you actually be using for your fallback? I'd assume something like memchr. If that's the case, then perhaps that's what Jetscii should actually be using under the hood any way.

BurntSushi commented 8 years ago

What would you actually be using for your fallback?

Yes, absolutely. :-)

The memchr functions are here: http://burntsushi.net/rustdoc/memchr/ Note the existence of memchr2 and memchr3, which I cobbled together specifically in cases where something like jetscii can't be used. They should be faster than the naive |b| b == b'A' || b == b'Z'.

@huonw - I wonder if you might have any special knowledge in this space, since I know you've been working on simd stuff. (I've looked at it, but can't quite grok how to use it for substring/byte matching!)

shepmaster commented 8 years ago

One thing I notice is that your performance numbers for memchr blow Jetscii out of the water:

which speed
Iterator::position 1893 MB/s
memchr 49504 MB/s
Jetscii 5719 MB/s

I'm not even sure why you'd want to switch!

BurntSushi commented 8 years ago

@shepmaster Hehe, well that's memchr, which only works on a single needle and is defined inside libc which is probably doing all sorts of cool things. It would be better to compare jetscii with memchr2 and memchr3, which I wrote based on memchr's fallback code written by @bluss. Even then, memchr2 and memchr3 only support 2 and 3 needles respectively, whereas jetscii will go up to 16.

To be clear, when there's a single leading byte, I won't use jetscii---I'll just use memchr. But if there's more than that.... :-)

BurntSushi commented 8 years ago

Here are the latest benchmarks in memchr from Travis: https://travis-ci.org/BurntSushi/rust-memchr/jobs/109233136#L251-L258 --- I'm not sure if those numbers are directly comparable to what you have in jetscii's README (but you could run the benchmarks on the same machine easily enough).

BurntSushi commented 8 years ago

@shepmaster Anything I can do to help move this along? :-)

shepmaster commented 8 years ago

@BurntSushi Let's try to get this started again! Maybe one thing that would help me would be if you could point (or hand-wave) to where ever some jetscii code would fit in.

I do see there's some SIMD stuff in regex already, so don't feel bad telling me that I'm just too late to catch the train with the cool kids 😊

BurntSushi commented 8 years ago

@shepmaster The SIMD stuff did relieve some of the pressure, but I bet jetscii could still be useful, for example, for matching an alternation of single bytes. e.g., the regex (a|b|c|d|e)\w+ seems like it would benefit from jetscii. The specific place in the code is probably as a different literal searcher: https://github.com/rust-lang-nursery/regex/blob/master/src/literals.rs#L47

shepmaster commented 8 years ago

Thank you! The code is pretty easy to understand; very nice!

I was on a plane and didn't have the benchmark suite downloaded, but the test suite takes 94% of the previous time with the Jetscii stuff added; an encouraging sign.

shepmaster commented 8 years ago

@BurntSushi care to give this an appraisal?

 name                                     bench-master ns/iter   bench-more-assembly ns/iter  diff ns/iter   diff % 
 misc::anchored_literal_long_match        24 (16250 MB/s)        26 (15000 MB/s)                         2    8.33% 
 misc::anchored_literal_long_non_match    27 (14444 MB/s)        25 (15600 MB/s)                        -2   -7.41% 
 misc::anchored_literal_short_match       23 (1130 MB/s)         24 (1083 MB/s)                          1    4.35% 
 misc::anchored_literal_short_non_match   27 (962 MB/s)          26 (1000 MB/s)                         -1   -3.70% 
 misc::easy0_1K                           69 (15231 MB/s)        69 (15231 MB/s)                         0    0.00% 
 misc::easy0_1MB                          87 (12052908 MB/s)     89 (11782056 MB/s)                      2    2.30% 
 misc::easy0_32                           69 (855 MB/s)          69 (855 MB/s)                           0    0.00% 
 misc::easy0_32K                          69 (475289 MB/s)       69 (475289 MB/s)                        0    0.00% 
 misc::easy1_1K                           54 (19333 MB/s)        55 (18981 MB/s)                         1    1.85% 
 misc::easy1_1MB                          57 (18396421 MB/s)     57 (18396421 MB/s)                      0    0.00% 
 misc::easy1_32                           55 (945 MB/s)          56 (928 MB/s)                           1    1.82% 
 misc::easy1_32K                          54 (607185 MB/s)       56 (585500 MB/s)                        2    3.70% 
 misc::hard_1K                            69 (15231 MB/s)        68 (15455 MB/s)                        -1   -1.45% 
 misc::hard_1MB                           85 (12336505 MB/s)     86 (12193058 MB/s)                      1    1.18% 
 misc::hard_32                            68 (867 MB/s)          70 (842 MB/s)                           2    2.94% 
 misc::hard_32K                           69 (475289 MB/s)       69 (475289 MB/s)                        0    0.00% 
 misc::literal                            20 (2550 MB/s)         18 (2833 MB/s)                         -2  -10.00% 
 misc::long_needle1                       3,230 (30960 MB/s)     3,221 (31046 MB/s)                     -9   -0.28% 
 misc::long_needle2                       812,507 (123 MB/s)     822,673 (121 MB/s)                 10,166    1.25% 
 misc::match_class                        99 (818 MB/s)          32 (2531 MB/s)                        -67  -67.68% 
 misc::match_class_in_range               36 (2250 MB/s)         31 (2612 MB/s)                         -5  -13.89% 
 misc::medium_1K                          71 (14816 MB/s)        72 (14611 MB/s)                         1    1.41% 
 misc::medium_1MB                         88 (11915954 MB/s)     89 (11782067 MB/s)                      1    1.14% 
 misc::medium_32                          72 (833 MB/s)          72 (833 MB/s)                           0    0.00% 
 misc::medium_32K                         72 (455500 MB/s)       72 (455500 MB/s)                        0    0.00% 
 misc::no_exponential                     500 (200 MB/s)         512 (195 MB/s)                         12    2.40% 
 misc::not_literal                        129 (395 MB/s)         135 (377 MB/s)                          6    4.65% 
 misc::one_pass_long_prefix               74 (351 MB/s)          74 (351 MB/s)                           0    0.00% 
 misc::one_pass_long_prefix_not           73 (356 MB/s)          74 (351 MB/s)                           1    1.37% 
 misc::one_pass_short                     54 (314 MB/s)          55 (309 MB/s)                           1    1.85% 
 misc::one_pass_short_not                 56 (303 MB/s)          55 (309 MB/s)                          -1   -1.79% 
 misc::reallyhard2_1K                     96 (10833 MB/s)        100 (10400 MB/s)                        4    4.17% 
 misc::reallyhard_1K                      2,328 (451 MB/s)       2,332 (450 MB/s)                        4    0.17% 
 misc::reallyhard_1MB                     2,349,958 (446 MB/s)   2,448,527 (428 MB/s)               98,569    4.19% 
 misc::reallyhard_32                      147 (401 MB/s)         158 (373 MB/s)                         11    7.48% 
 misc::reallyhard_32K                     72,215 (454 MB/s)      70,968 (462 MB/s)                  -1,247   -1.73% 
 misc::reverse_suffix_no_quadratic        7,563 (1057 MB/s)      2,088 (3831 MB/s)                  -5,475  -72.39% 
 regexdna::find_new_lines                 18,613,370 (273 MB/s)  18,270,181 (278 MB/s)            -343,189   -1.84% 
 regexdna::subst1                         1,095,902 (4638 MB/s)  1,149,494 (4422 MB/s)              53,592    4.89% 
 regexdna::subst10                        1,159,534 (4384 MB/s)  1,182,921 (4297 MB/s)              23,387    2.02% 
 regexdna::subst11                        1,103,694 (4605 MB/s)  1,137,528 (4468 MB/s)              33,834    3.07% 
 regexdna::subst2                         1,119,467 (4540 MB/s)  1,146,736 (4432 MB/s)              27,269    2.44% 
 regexdna::subst3                         1,099,312 (4624 MB/s)  1,190,286 (4270 MB/s)              90,974    8.28% 
 regexdna::subst4                         1,123,742 (4523 MB/s)  1,131,335 (4493 MB/s)               7,593    0.68% 
 regexdna::subst5                         1,098,962 (4625 MB/s)  1,134,847 (4479 MB/s)              35,885    3.27% 
 regexdna::subst6                         1,099,016 (4625 MB/s)  1,133,407 (4485 MB/s)              34,391    3.13% 
 regexdna::subst7                         1,101,411 (4615 MB/s)  1,132,839 (4487 MB/s)              31,428    2.85% 
 regexdna::subst8                         1,118,263 (4545 MB/s)  1,191,274 (4267 MB/s)              73,011    6.53% 
 regexdna::subst9                         1,127,738 (4507 MB/s)  1,113,407 (4565 MB/s)             -14,331   -1.27% 
 regexdna::variant1                       4,895,873 (1038 MB/s)  4,997,135 (1017 MB/s)             101,262    2.07% 
 regexdna::variant2                       8,394,295 (605 MB/s)   8,543,739 (594 MB/s)              149,444    1.78% 
 regexdna::variant3                       9,993,058 (508 MB/s)   10,457,008 (486 MB/s)             463,950    4.64% 
 regexdna::variant4                       10,092,960 (503 MB/s)  10,644,563 (477 MB/s)             551,603    5.47% 
 regexdna::variant5                       8,498,861 (598 MB/s)   8,773,161 (579 MB/s)              274,300    3.23% 
 regexdna::variant6                       8,309,121 (611 MB/s)   8,289,828 (613 MB/s)              -19,293   -0.23% 
 regexdna::variant7                       8,707,494 (583 MB/s)   8,932,302 (569 MB/s)              224,808    2.58% 
 regexdna::variant8                       8,883,837 (572 MB/s)   9,022,304 (563 MB/s)              138,467    1.56% 
 regexdna::variant9                       8,800,123 (577 MB/s)   8,819,907 (576 MB/s)               19,784    0.22% 
 sherlock::before_after_holmes            1,318,164 (451 MB/s)   1,334,614 (445 MB/s)               16,450    1.25% 
 sherlock::before_holmes                  86,066 (6912 MB/s)     87,794 (6776 MB/s)                  1,728    2.01% 
 sherlock::everything_greedy              2,955,550 (201 MB/s)   2,963,474 (200 MB/s)                7,924    0.27% 
 sherlock::everything_greedy_nl           1,280,764 (464 MB/s)   1,264,705 (470 MB/s)              -16,059   -1.25% 
 sherlock::holmes_cochar_watson           212,473 (2800 MB/s)    215,042 (2766 MB/s)                 2,569    1.21% 
 sherlock::holmes_coword_watson           710,506 (837 MB/s)     713,156 (834 MB/s)                  2,650    0.37% 
 sherlock::ing_suffix                     478,865 (1242 MB/s)    506,858 (1173 MB/s)                27,993    5.85% 
 sherlock::ing_suffix_limited_space       1,537,909 (386 MB/s)   1,690,583 (351 MB/s)              152,674    9.93% 
 sherlock::letters                        28,868,458 (20 MB/s)   31,051,139 (19 MB/s)            2,182,681    7.56% 
 sherlock::letters_lower                  28,242,104 (21 MB/s)   30,509,948 (19 MB/s)            2,267,844    8.03% 
 sherlock::letters_upper                  2,372,619 (250 MB/s)   2,548,747 (233 MB/s)              176,128    7.42% 
 sherlock::line_boundary_sherlock_holmes  1,356,437 (438 MB/s)   1,379,389 (431 MB/s)               22,952    1.69% 
 sherlock::name_alt1                      42,433 (14020 MB/s)    43,734 (13603 MB/s)                 1,301    3.07% 
 sherlock::name_alt2                      183,175 (3247 MB/s)    191,617 (3104 MB/s)                 8,442    4.61% 
 sherlock::name_alt3                      195,035 (3050 MB/s)    204,909 (2903 MB/s)                 9,874    5.06% 
 sherlock::name_alt3_nocase               1,476,059 (403 MB/s)   1,547,693 (384 MB/s)               71,634    4.85% 
 sherlock::name_alt4                      227,746 (2612 MB/s)    256,287 (2321 MB/s)                28,541   12.53% 
 sherlock::name_alt4_nocase               310,494 (1916 MB/s)    337,413 (1763 MB/s)                26,919    8.67% 
 sherlock::name_alt5                      188,038 (3163 MB/s)    209,919 (2834 MB/s)                21,881   11.64% 
 sherlock::name_alt5_nocase               822,826 (723 MB/s)     894,400 (665 MB/s)                 71,574    8.70% 
 sherlock::name_holmes                    57,112 (10416 MB/s)    54,114 (10994 MB/s)                -2,998   -5.25% 
 sherlock::name_holmes_nocase             274,654 (2166 MB/s)    275,834 (2156 MB/s)                 1,180    0.43% 
 sherlock::name_sherlock                  76,547 (7772 MB/s)     79,323 (7500 MB/s)                  2,776    3.63% 
 sherlock::name_sherlock_holmes           42,817 (13894 MB/s)    44,528 (13360 MB/s)                 1,711    4.00% 
 sherlock::name_sherlock_holmes_nocase    241,296 (2465 MB/s)    244,435 (2433 MB/s)                 3,139    1.30% 
 sherlock::name_sherlock_nocase           236,539 (2515 MB/s)    247,517 (2403 MB/s)                10,978    4.64% 
 sherlock::name_whitespace                89,045 (6681 MB/s)     88,568 (6717 MB/s)                   -477   -0.54% 
 sherlock::no_match_common                28,122 (21155 MB/s)    28,633 (20777 MB/s)                   511    1.82% 
 sherlock::no_match_really_common         410,134 (1450 MB/s)    459,290 (1295 MB/s)                49,156   11.99% 
 sherlock::no_match_uncommon              27,861 (21353 MB/s)    30,621 (19428 MB/s)                 2,760    9.91% 
 sherlock::quotes                         644,258 (923 MB/s)     633,001 (939 MB/s)                -11,257   -1.75% 
 sherlock::repeated_class_negation        108,826,573 (5 MB/s)   112,777,709 (5 MB/s)            3,951,136    3.63% 
 sherlock::the_lower                      736,800 (807 MB/s)     776,598 (766 MB/s)                 39,798    5.40% 
 sherlock::the_nocase                     603,604 (985 MB/s)     625,625 (950 MB/s)                 22,021    3.65% 
 sherlock::the_upper                      55,682 (10684 MB/s)    61,550 (9665 MB/s)                  5,868   10.54% 
 sherlock::the_whitespace                 1,363,712 (436 MB/s)   1,373,879 (433 MB/s)               10,167    0.75% 
 sherlock::word_ending_n                  2,476,486 (240 MB/s)   2,509,844 (237 MB/s)               33,358    1.35% 
 sherlock::words                          11,806,797 (50 MB/s)   11,842,803 (50 MB/s)               36,006    0.30% 
shepmaster commented 7 years ago

poke @BurntSushi

BurntSushi commented 7 years ago

@shepmaster Can you show the code you used to generate those benchmarks? It looks like the improvement to match_class is because of jetscii, right? I was surprised to see the improvement to reverse_suffix_no_quadratic as well, since that should be using a suffix literal, maybe you changed it to use the beginning character class?

shepmaster commented 7 years ago

Can you show the code you used to generate those benchmarks?

Oh, yes, that was silly of me to omit. Here's my regex branch. You'll see that it's basically a direct clone of the existing types. I also have changes to jetscii, which basically boil down to copy-paste-remove checks.

maybe you changed it to use the beginning character class?

Nope! That's one of many reasons why your expertise is requested 😜.