Investigation into ASCII ctype inherent methods performance with lookup table

CDirkx commented 4 years ago

Prompted by https://github.com/rust-lang/rust/issues/68983#issuecomment-613753557, I started looking into whether the performance of methods like is_ascii_alphabetic could be improved by using a lookup table.

The full list of methods:

is_ascii_alphabetic
is_ascii_alphanumeric
is_ascii_control
is_ascii_digit
is_ascii_graphic
is_ascii_hexdigit
is_ascii_lowercase
is_ascii_punctuation
is_ascii_uppercase
is_ascii_whitespace

Implementations

Currently, all of these methods are implemented by matching on characters/ranges:

#[inline]
pub const fn is_ascii_alphabetic(&self) -> bool {
    matches!(*self, b'A'..=b'Z' | b'a'..=b'z')
}

I investigated two ways of implementing these functions with a lookup table:

A lookup table for the entire u8 range:

#[inline]
pub const fn is_ascii_alphabetic(&self) -> bool {
    const LOOKUP_TABLE : [bool; 258] = ...;

    LOOKUP_TABLE[*self as usize]
}

A hybrid approach by checking if the byte is ascii first, reducing table size by half:

#[inline]
pub const fn is_ascii_alphabetic(&self) -> bool {
    const LOOKUP_TABLE : [bool; 128] = ...;

    self.is_ascii() && LOOKUP_TABLE[*self as usize]
}

I will be calling these implementations branch, lookup and hybrid respectively throughout the rest of this report.

Using the features const_if_match and const_loop, the lookup tables for lookup and hybrid can be easily generated:

```rust macro lookup_table($condition:expr, $n:literal) { { const LOOKUP_TABLE: [bool; $n] = { let mut lookup = [false; $n]; let mut i = 0; while i < $n { lookup[i] = $condition(&(i as u8)); i += 1; } lookup }; LOOKUP_TABLE } } #[inline] pub const fn is_ascii_alphabetic_lookup(byte: &u8) -> bool { lookup_table!(u8::is_ascii_alphabetic, 256)[*self as usize] } #[inline] pub const fn is_ascii_alphabetic_hybrid(byte: &u8) -> bool { byte.is_ascii() && lookup_table!(u8::is_ascii_alphabetic, 128)[*self as usize] } ```

The instructions these different implementations compile down to can be compared on godbolt.

Benchmark

Note: I do not have enough experience with benchmarking to know if this approach is fully representative, let me know if you know any way the benchmark can be improved.

The task I used for benchmarking is iterating through the bytes of a source text and counting how many of them satisfy a condition, e.g. is_ascii_alphabetic:

const SOURCE_TEXT : &'static str = ...;

#[bench]
fn is_ascii_alphabetic_branch {
    $bench.iter(|| {
        let mut total = 0;
        for byte in SOURCE_TEXT.bytes() {
            if byte.is_ascii_alphabetic { total += 1; }
        }

        assert_eq!(total, ...);
    });
}

This results in a tight loop, which due to caching might be favorable to the lookup tables. However, some quick searching for the use of the ascii methods in open source projects reveals at least a few instances of them being used in a loop or iterator filter, so the benchmark is representative of at least some real-world usage.

As a source text I used "Hamlet, Prince of Denmark", adapted from https://www.gutenberg.org/files/27761/27761-0.txt, a primarily ascii text of 4k+ lines:

Benchmark

```rust #![feature(test)] #![feature(decl_macro)] #![feature(const_if_match)] #![feature(const_loop)] #![feature(const_ascii_ctype_on_intrinsics)] extern crate test; macro lookup_table($condition:expr, $n:literal) { { const LOOKUP_TABLE: [bool; $ n] = { let mut lookup_table = [false; $ n]; let mut i = 0; while i < $ n { lookup_table[i] = $ condition(&(i as u8)); i += 1; } lookup_table }; LOOKUP_TABLE } } macro bench($condition:ident, $branch_ident:ident, $hybrid_ident:ident, $lookup_ident:ident, $expected:literal) { #[bench] fn $branch_ident(bench: &mut test::Bencher) { bench_impl!(bench, u8::$condition, $expected); } #[bench] fn $hybrid_ident(bench: &mut test::Bencher) { #[inline] fn condition(byte: &u8) -> bool { byte.is_ascii() && lookup_table!(u8::$condition, 128)[*byte as usize] } bench_impl!(bench, condition, $expected); } #[bench] fn $lookup_ident(bench: &mut test::Bencher) { #[inline] fn condition(byte: &u8) -> bool { lookup_table!(u8::$condition, 256)[*byte as usize] } bench_impl!(bench, condition, $expected); } } macro bench_impl($bench:expr, $condition:expr, $expected:literal) { $bench.iter(|| { let mut total = 0; for byte in SOURCE_TEXT.bytes() { if $condition(&byte) { total += 1; } } assert_eq!(total, $expected); }); } // "Hamlet, Prince of Denmark", adapted from https://www.gutenberg.org/files/27761/27761-0.txt const SOURCE_TEXT: &'static str = include_str!("hamlet.txt"); bench!(is_ascii_alphabetic, is_ascii_alphabetic_branch, is_ascii_alphabetic_hybrid, is_ascii_alphabetic_lookup, 83663); bench!(is_ascii_alphanumeric, is_ascii_alphanumeric_branch, is_ascii_alphanumeric_hybrid, is_ascii_alphanumeric_lookup, 83718); bench!(is_ascii_control, is_ascii_control_branch, is_ascii_control_hybrid, is_ascii_control_lookup, 8216); bench!(is_ascii_digit, is_ascii_digit_branch, is_ascii_digit_hybrid, is_ascii_digit_lookup, 55); bench!(is_ascii_graphic, is_ascii_graphic_branch, is_ascii_graphic_hybrid, is_ascii_graphic_lookup, 94284); bench!(is_ascii_hexdigit, is_ascii_hexdigit_branch, is_ascii_hexdigit_hybrid, is_ascii_hexdigit_lookup, 23825); bench!(is_ascii_lowercase, is_ascii_lowercase_branch, is_ascii_lowercase_hybrid, is_ascii_lowercase_lookup, 76787); bench!(is_ascii_punctuation, is_ascii_punctuation_branch, is_ascii_punctuation_hybrid, is_ascii_punctuation_lookup, 10566); bench!(is_ascii_uppercase, is_ascii_uppercase_branch, is_ascii_uppercase_hybrid, is_ascii_uppercase_lookup, 6876); bench!(is_ascii_whitespace, is_ascii_whitespace_branch, is_ascii_whitespace_hybrid, is_ascii_whitespace_lookup, 33893); ```

Files: benches.zip

Results

method	`branch`	`hybrid`	`lookup`
`is_ascii_alphabetic`	364,130 ±15,302	349,235 ±32,393	402,965 ±22,624
`is_ascii_alphanumeric`	310,435 ±12,975	346,025 ±18,873	404,880 ±23,582
`is_ascii_control`	154,802 ±17,761	105,160 ±10,883	197,493 ±73,893
`is_ascii_digit`	96,971 ± 23,020	98,597 ± 9,754	97,051 ± 4,354
`is_ascii_graphic`	274,238 ± 12,480	296,010 ± 24,268	340,445 ± 32,739
`is_ascii_hexdigit`	358,390 ± 21,165	280,170 ± 17,197	309,484 ±9,983
`is_ascii_lowercase`	329,840 ± 45,032	334,270 ±21,637	398,315 ±18,173
`is_ascii_punctuation`	372,600 ± 22,126	157,147 ±7,353	173,555 ±19,918
`is_ascii_uppercase`	131,011 ±26,059	135,658 ±19,618	142,463 ±19,191
`is_ascii_whitespace`	232,675 ±40,710	292,750 ±61,629	350,440 ±56,609

Adapted from the output of cargo bench, results are in ns.

Analysis

It seems the hybrid approach with the smaller lookup table was overall as fast or faster than lookup, even though it has to do an extra check (smaller table fits better in cache?).

The hybrid approach was significantly faster than the current branch implementation for:

is_ascii_control (105,160 ±10,883 vs. 154,802 ±17,761)
is_ascii_hexdigit (280,170 ± 17,197 vs. 358,390 ± 21,165)
is_ascii_punctuation (157,147 ±7,353 vs. 372,600 ± 22,126)

Looking at the current implementation for is_ascii_hexdigit and is_ascii_punctuation and compiler output, these two methods have the most complex branches of all the ascii methods, indicating that there is indeed potential for a faster implementation using a lookup table.

Why the hybrid version of is_ascii_control is faster I can not say, maybe caching or an artifact of the way I benchmarked?

Conclusion

From this preliminary investigation it seems that at least some of the ascii methods can be implemented faster, prompting further investigation.

As further steps I propose first evaluating the merits of this benchmark, and then conduct more. There are still a number of questions: are the results of this benchmark correct, or is another factor interfering with the results? Is the benchmark representative of real-world usage? Is the source text influencing the benchmark?

Once we are sure our benchmarks are measuring what we want, we can proceed with the investigation: quantify the trade-off of speed vs. memory, and compare alternate implementations. This information will inform the discussion on whether it makes sense to change the current implementations.

Mark-Simulacrum commented 4 years ago

Please note (I haven't had time to read your entire comment, sorry) that in the case of the methods on char, being in libcore, it is generally pretty important that we keep an eye on the size of the generated code as well as the performance.

CDirkx commented 4 years ago

Yes, that is why I'm also looking into other implementations (without a lookup table). Specifically is_ascii_punctuation, which is currently implemented as:

#[inline]
pub const fn is_ascii_punctuation(&self) -> bool {
   matches!(*self, b'!'..=b'/' | b':'..=b'@' | b'['..=b'`' | b'{'..=b'~')
}

which results in complex branching.

An alternative implementation would be:

#[inline]
pub const fn is_ascii_punctuation(&self) -> bool {
   matches!(*self, b'!'..=b'@' | b'['..=b'`' | b'{'..=b'~') && !matches!(*self, b'0'..=b'9')
}

Which seems to generate a bit more efficient code and in my benchmark results in a slight speedup (although using a lookup table is still twice as fast). There might exist other, even more efficient bit-twiddling implementations.

However, measuring these results still relies upon the quality of my benchmark, of which I'm not a 100% sure. That is why I posted this report, to get some feedback on the best way to measure these things.

Any feedback or help is appreciated, but take the time you need.

leonardo-m commented 4 years ago

What is the effect on the L1 cache pressure?

estebank commented 4 years ago

@CDirkx could you add another approach of a sorted positive char list? The algo complexity of binary searching or scanning the array might not matter given that for is_ascii_punctuation we are talking about 32 bytes.

Edit: I just realized that [bool; 128] should also be 32 bytes.

CDirkx commented 4 years ago

I compared some other implementations, again specifically for is_ascii_punctuation:

branch: The current implementation, branches do not get optimized away.

pub const fn is_ascii_punctuation_branch(byte: &u8) -> bool {
    matches!(*byte, b'!'..=b'/' | b':'..=b'@' | b'['..=b'`' | b'{'..=b'~')
}

branch_adapted: Adapted version of the current implementation. I tried different ranges and order, this was the fastest variant, branches get optimized away.

pub const fn is_ascii_punctuation_branch_adapted(byte: &u8) -> bool {
    // !byte.is_alphanumeric() && byte.is_graphic()
    !matches!(*byte, b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') && matches!(*byte, b'!'..=b'~')
}

lookup: A lookup table for the entire u8 range.

pub const fn is_ascii_punctuation_lookup(byte: &u8) -> bool {
    const LOOKUP : [bool; 256] = ...;

    LOOKUP[*byte as usize]
}

hybrid: Checking is_ascii and a lookup table for the entire ascii range.

pub const fn is_ascii_punctuation_hybrid(byte: &u8) -> bool {
    const LOOKUP : [bool; 128] = ...;

    byte.is_ascii() && LOOKUP[*byte as usize]
}

linear_search: Linear search through a list of all 32 punctuation characters. Expected to be slow but added for completeness. Even tested with cheating by sorting the characters by frequency in the source text, but performed the worst. I read about faster linear search using SIMD instructions, but such an implementation would not be portable and thus not a great fit for core.

pub fn is_ascii_punctuation_linear_search(byte: &u8) -> bool {
    const LOOKUP : [u8; 32] = ...;

    LOOKUP.contains(byte)
}

binary_search: Binary search through a sorted list of all 32 punctuation characters. Gets optimized to an unrolled, branchless binary search.

pub fn is_ascii_punctuation_binary_search(byte: &u8) -> bool {
    const LOOKUP : [u8; 32] = ...;

    LOOKUP.binary_search(byte).is_ok()
}

Re: @estebank, actually [bool; 128] is still 128 bytes, but this got me thinking about using bitsets to reduce the size of the lookup table:

lookup_bitset: A lookup table for the entire u8 range using a bitset.

pub const fn is_ascii_punctuation_lookup_bitset(byte: &u8) -> bool {
    const LOOKUP : [u8; 32] = ...;

    LOOKUP[(*byte / 8) as usize] >> (*byte % 8) & 1u8 == 1
}

hybrid_bitset: Checking is_ascii and a lookup table for the entire ascii range using a bitset. I tried using a single u128 as a bitset instead of [u8; 16], but that resulted in worse code generation and slower results.

pub const fn is_ascii_punctuation_hybrid_bitset(byte: &u8) -> bool {
    const LOOKUP : [u8; 16] = ...;

    byte.is_ascii() && LOOKUP[(*byte / 8) as usize] >> (*byte % 8) & 1u8 == 1
}

Generated code for comparison: https://rust.godbolt.org/z/mwf4B4.

Benchmark

Same setup as before:

```rust #![feature(test)] #![feature(decl_macro)] extern crate test; // "Hamlet, Prince of Denmark", adapted from https://www.gutenberg.org/files/27761/27761-0.txt const SOURCE_TEXT: &'static str = include_str!("hamlet.txt"); macro bench_impl($bench:expr, $condition:expr) { $bench.iter(|| { let mut total = 0; for byte in SOURCE_TEXT.bytes() { if $condition(&byte) { total += 1; } } assert_eq!(total, 10566); }); } #[bench] fn is_ascii_punctuation_branch(bench: &mut test::Bencher) { bench_impl!(bench, |byte: &u8| { byte.is_ascii_punctuation() }); } #[bench] fn is_ascii_punctuation_branch_adapted(bench: &mut test::Bencher) { bench_impl!(bench, |byte: &u8| { !byte.is_ascii_alphanumeric() && byte.is_ascii_graphic() }); } #[bench] fn is_ascii_punctuation_lookup(bench: &mut test::Bencher) { const LOOKUP : [bool; 256] = [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false]; bench_impl!(bench, |byte: &u8| { LOOKUP[*byte as usize] }); } #[bench] fn is_ascii_punctuation_hybrid(bench: &mut test::Bencher) { const LOOKUP : [bool; 128] = [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, false]; bench_impl!(bench, |byte: &u8| { byte.is_ascii() && LOOKUP[*byte as usize] }); } #[bench] fn is_ascii_punctuation_linear_search(bench: &mut test::Bencher) { const LOOKUP : [u8; 32] = [46, 95, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 47, 58, 59, 60, 61, 62, 63, 64, 91, 92, 93, 94, 96, 123, 124, 125, 126]; bench_impl!(bench, |byte: &u8| { LOOKUP.contains(byte) }); } #[bench] fn is_ascii_punctuation_binary_search(bench: &mut test::Bencher) { const LOOKUP : [u8; 32] = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 58, 59, 60, 61, 62, 63, 64, 91, 92, 93, 94, 95, 96, 123, 124, 125, 126]; bench_impl!(bench, |byte: &u8| { LOOKUP.binary_search(byte).is_ok() }); } #[bench] fn is_ascii_punctuation_lookup_bitset(bench: &mut test::Bencher) { const LOOKUP : [u8; 32] = [0, 0, 0, 0, 254, 255, 0, 252, 1, 0, 0, 248, 1, 0, 0, 120, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]; bench_impl!(bench, |byte: &u8| { LOOKUP[(*byte / 8) as usize] >> (*byte % 8) & 1u8 == 1 }); } #[bench] fn is_ascii_punctuation_hybrid_bitset(bench: &mut test::Bencher) { const LOOKUP : [u8; 16] = [0, 0, 0, 0, 254, 255, 0, 252, 1, 0, 0, 248, 1, 0, 0, 120]; bench_impl!(bench, |byte: &u8| { byte.is_ascii() && LOOKUP[(*byte / 8) as usize] >> (*byte % 8) & 1u8 == 1 }); } ```

Results

method	time (ns)	memory (bytes)
`branch`	379,555 ±26,571	0
`branch_adapted`	295,031 ±33,882	0
`lookup`	174,720 ±15,657	256
`hybrid`	152,943 ±7,762	128
`linear_search`	732,040 ±55,886	32
`binary_search`	539,240 ±30,553	32
`lookup_bitset`	166,508 ±30,856	32
`hybrid_bitset`	226,147 ±54,628	16

Adapted from the output of cargo bench.

Analysis

branch_adapted is faster than the current implementation branch, without any memory trade-off, since it allows all branches to be optimized away. I can do some more benchmarks using other source texts/other setup to confirm the speedup, but this is currently looking like a good candidate for replacing the current core implementation.

linear_search and binary_search do not result in any performance benefit, too many comparisons and array look-ups compared to the optimized versions of the other implementations I think.

lookup_bitset and hybrid_bitset are interesting, as they are only a little bit slower than lookup and hybrid, but have a 8x reduction in table size, making the speed vs. memory trade-off more favourable.

Conclusion

I think these tests uncovered two valuable techniques:

The current implementation could be rewritten to a version that optimizes better, without needing extra memory.
Reducing lookup table size by using bitsets results in similar better performance.

Now we can test if these techniques are also effective with the other methods.

As before, the results above might be influenced by the design of the benchmark, so I was planning on also doing some further investigation into usage of these methods in the wild, other source texts and other benchmarking tasks.

alex-700 commented 4 years ago

@CDirkx

There is another revision of the hybrid_bitset approach.

pub fn is_ascii_punctuation_hybrid_bitset(byte: &u8) -> bool {
    const LOOKUP: u128 = 79753679825085174867510150554292584448;

    LOOKUP >> *byte & 1 == 1
}

I dropped is_ascii() check, because we can consider that LOOKUP has an infinite number of leading zeros in the binary notation.

LingMan commented 3 years ago

@alex-700 You can't drop the is_ascii check because >> with a rhs that's larger than or equal to the bit size of the lhs is an integer overflow. Your code will panic in debug mode for any byte > 127, and compute the wrong results in release mode: https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=8ff9931dd1f05332c73c5c452d2c18bb

There's also an off-by-one as Z, z, 9, ... surely aren't punctuation.

Edit: Corrected lookup table for that off-by-one is 159507359650170349735020301108585168896

alex-700 commented 3 years ago

@LingMan thanks! Shame on me :(

Adding is_ascii() check or checked_shr() ruin good generated code.

leonardo-m commented 3 years ago

Shame on me :(

Nothing to be ashamed of. It's a mistake. Everyone makes mistakes all the time. Recognize it, learn from it, do better and carry on :-)

gilescope commented 3 years ago

is_ascii_hexdigit - I looked at optimising that by fliping the bit to make it uppercase and using a wrapping_sub but it came out to be the same number of instructions.

gilescope commented 3 years ago

(Remember for core embedded stuff memory is tight. My keyboard for example only has 20kb ram.)

rust-lang / rust