lifthrasiir / rust-encoding

Character encoding support for Rust
MIT License
284 stars 59 forks source link

Performance: Consider replacing lookup tables with match statements or binary search in single byte index #125

Open john-parton opened 4 years ago

john-parton commented 4 years ago

The current technique for building the single byte "forward" and "backward" function is to generate lookup tables using gen_index.py

Here's an example generated file: https://github.com/lifthrasiir/rust-encoding/blob/master/src/index/singlebyte/windows_1252.rs

There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.

So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.

/// Some Latin-1 text to test
//
// the first few sentences of the article "An Ghaeilge" from Irish Wikipedia.
// https://ga.wikipedia.org/wiki/An_Ghaeilge
pub static IRISH_TEXT: &'static str =
    "Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar \
     uirthi corruair), agus ceann den dtrí cinn de theangacha Ceilteacha ar a dtugtar na \
     teangacha Gaelacha (.i. an Ghaeilge, Gaeilge na hAlban agus Gaeilge Mhanann) go háirithe. \
     Labhraítear in Éirinn go príomha í, ach tá cainteoirí Gaeilge ina gcónaí in áiteanna eile ar \
     fud an domhain. Is í an teanga náisiúnta nó dhúchais agus an phríomhtheanga oifigiúil i \
     bPoblacht na hÉireann í an Ghaeilge. Tá an Béarla luaite sa Bhunreacht mar theanga oifigiúil \
     eile. Tá aitheantas oifigiúil aici chomh maith i dTuaisceart Éireann, atá mar chuid den \
     Ríocht Aontaithe. Ar an 13 Meitheamh 2005 d'aontaigh airí gnóthaí eachtracha an Aontais \
     Eorpaigh glacadh leis an nGaeilge mar theanga oifigiúil oibre san AE";

pub static RUSSIAN_TEXT: &'static str =
    "Ру?сский язы?к Информация о файле слушать)[~ 3] один из восточнославянских языков, \
     национальный язык русского народа. Является одним из наиболее распространённых языков мира \
     шестым среди всех языков мира по общей численности говорящих и восьмым по численности \
     владеющих им как родным[9]. Русский является также самым распространённым славянским \
     языком[10] и самым распространённым языком в Европе ? географически и по числу носителей \
     языка как родного[7]. Русский язык ? государственный язык Российской Федерации, один из \
     двух государственных языков Белоруссии, один из официальных языков Казахстана, Киргизии и \
     некоторых других стран, основной язык международного общения в Центральной Евразии, в \
     Восточной Европе, в странах бывшего Советского Союза, один из шести рабочих языков ООН, \
     ЮНЕСКО и других международных организаций[11][12][13].";

#[bench]
fn bench_encode_irish(bencher: &mut test::Bencher) {
    bencher.bytes = IRISH_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.encode(&ASCII_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_irish(bencher: &mut test::Bencher) {
    let bytes = WINDOWS_1252.encode(IRISH_TEXT, EncoderTrap::Strict).unwrap();

    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_encode_russian(bencher: &mut test::Bencher) {
    bencher.bytes = RUSSIAN_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.encode(&RUSSIAN_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_russian(bencher: &mut test::Bencher) {
    let bytes = ISO_8859_5.encode(RUSSIAN_TEXT, EncoderTrap::Strict).unwrap();

    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

I picked the windows-1252 encoding because it's similar to the old latin-1 standard and can encode the special characters in the Irish text I grabbed, and iso-8859-5 for similar reasons for the Russian test.

I rewrote gen_index.py to create match statements instead of building a lookup table. You get something like this:

// AUTOGENERATED FROM index-windows-1252.txt, ORIGINAL COMMENT FOLLOWS:
//
// For details on index index-windows-1252.txt see the Encoding Standard
// https://encoding.spec.whatwg.org/
//
// Identifier: e56d49d9176e9a412283cf29ac9bd613f5620462f2a080a84eceaf974cfa18b7
// Date: 2018-01-06
#[inline]
pub fn forward(code: u8) -> Option<u16> {
    match code {
        128 => Some(8364),
        129 => Some(129),
        130 => Some(8218),
        131 => Some(402),
        132 => Some(8222),
        133 => Some(8230),
        134 => Some(8224),
        135 => Some(8225),
        136 => Some(710),
        137 => Some(8240),
        //  a bunch more items
        250 => Some(250),
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

#[inline]
pub fn backward(code: u32) -> Option<u8> {
    match code {
        8364 => Some(128),
        129 => Some(129),
        8218 => Some(130),
        402 => Some(131),
        8222 => Some(132),
        8230 => Some(133),
        8224 => Some(134),
        8225 => Some(135),
        710 => Some(136),
        8240 => Some(137),
        352 => Some(138),
        8249 => Some(139),
        338 => Some(140),
        141 => Some(141),
        381 => Some(142),
        //  a bunch more items
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

Note that I changed the function signature to return an Option instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.

I also generated a version that uses a binary search. It's pretty simple.

const BACKWARD_KEYS: &'static [u32] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 215, 247, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499,
    1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 8206,
    8207, 8215
];

const BACKWARD_VALUES: &'static [u8] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 170, 186, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
    238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 253, 254, 223
];

#[inline]
pub fn backward(code: u32) -> u8 {
    if let Ok(index) = BACKWARD_KEYS.binary_search(&code) {
        BACKWARD_VALUES[index]
    } else {
        0
    }
}

Here's a table comparing the three techniques (scroll to see entire table):

test master       match         binary search        
codec::singlebyte::tests::bench_decode_irish 3246 ns/iter 240 MB/s 3171 ns/iter 245 MB/s 2.08%          
codec::singlebyte::tests::bench_decode_russian 8508 ns/iter 98 MB/s 8890 ns/iter 94 MB/s -4.08%          
codec::singlebyte::tests::bench_encode_irish 2622 ns/iter 310 MB/s 1688 ns/iter 482 MB/s 55.48% 2243 ns/iter 363 MB/s 17.10%
codec::singlebyte::tests::bench_encode_russian 6692 ns/iter 228 MB/s 10578 ns/iter 144 MB/s -36.84% 10019 ns/iter 152 MB/s -33.33%

Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.

It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.

I'll try running some more tests.