Performance: Consider replacing lookup tables with match statements or binary search in single byte index

The current technique for building the single byte "forward" and "backward" function is to generate lookup tables using gen_index.py

Here's an example generated file: https://github.com/lifthrasiir/rust-encoding/blob/master/src/index/singlebyte/windows_1252.rs

There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.

So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.

/// Some Latin-1 text to test
//
// the first few sentences of the article "An Ghaeilge" from Irish Wikipedia.
// https://ga.wikipedia.org/wiki/An_Ghaeilge
pub static IRISH_TEXT: &'static str =
    "Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar \
     uirthi corruair), agus ceann den dtrí cinn de theangacha Ceilteacha ar a dtugtar na \
     teangacha Gaelacha (.i. an Ghaeilge, Gaeilge na hAlban agus Gaeilge Mhanann) go háirithe. \
     Labhraítear in Éirinn go príomha í, ach tá cainteoirí Gaeilge ina gcónaí in áiteanna eile ar \
     fud an domhain. Is í an teanga náisiúnta nó dhúchais agus an phríomhtheanga oifigiúil i \
     bPoblacht na hÉireann í an Ghaeilge. Tá an Béarla luaite sa Bhunreacht mar theanga oifigiúil \
     eile. Tá aitheantas oifigiúil aici chomh maith i dTuaisceart Éireann, atá mar chuid den \
     Ríocht Aontaithe. Ar an 13 Meitheamh 2005 d'aontaigh airí gnóthaí eachtracha an Aontais \
     Eorpaigh glacadh leis an nGaeilge mar theanga oifigiúil oibre san AE";

pub static RUSSIAN_TEXT: &'static str =
    "Ру?сский язы?к Информация о файле слушать)[~ 3] один из восточнославянских языков, \
     национальный язык русского народа. Является одним из наиболее распространённых языков мира \
     шестым среди всех языков мира по общей численности говорящих и восьмым по численности \
     владеющих им как родным[9]. Русский является также самым распространённым славянским \
     языком[10] и самым распространённым языком в Европе ? географически и по числу носителей \
     языка как родного[7]. Русский язык ? государственный язык Российской Федерации, один из \
     двух государственных языков Белоруссии, один из официальных языков Казахстана, Киргизии и \
     некоторых других стран, основной язык международного общения в Центральной Евразии, в \
     Восточной Европе, в странах бывшего Советского Союза, один из шести рабочих языков ООН, \
     ЮНЕСКО и других международных организаций[11][12][13].";

#[bench]
fn bench_encode_irish(bencher: &mut test::Bencher) {
    bencher.bytes = IRISH_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.encode(&ASCII_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_irish(bencher: &mut test::Bencher) {
    let bytes = WINDOWS_1252.encode(IRISH_TEXT, EncoderTrap::Strict).unwrap();

    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_encode_russian(bencher: &mut test::Bencher) {
    bencher.bytes = RUSSIAN_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.encode(&RUSSIAN_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_russian(bencher: &mut test::Bencher) {
    let bytes = ISO_8859_5.encode(RUSSIAN_TEXT, EncoderTrap::Strict).unwrap();

    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

I picked the windows-1252 encoding because it's similar to the old latin-1 standard and can encode the special characters in the Irish text I grabbed, and iso-8859-5 for similar reasons for the Russian test.

I rewrote gen_index.py to create match statements instead of building a lookup table. You get something like this:

// AUTOGENERATED FROM index-windows-1252.txt, ORIGINAL COMMENT FOLLOWS:
//
// For details on index index-windows-1252.txt see the Encoding Standard
// https://encoding.spec.whatwg.org/
//
// Identifier: e56d49d9176e9a412283cf29ac9bd613f5620462f2a080a84eceaf974cfa18b7
// Date: 2018-01-06
#[inline]
pub fn forward(code: u8) -> Option<u16> {
    match code {
        128 => Some(8364),
        129 => Some(129),
        130 => Some(8218),
        131 => Some(402),
        132 => Some(8222),
        133 => Some(8230),
        134 => Some(8224),
        135 => Some(8225),
        136 => Some(710),
        137 => Some(8240),
        //  a bunch more items
        250 => Some(250),
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

#[inline]
pub fn backward(code: u32) -> Option<u8> {
    match code {
        8364 => Some(128),
        129 => Some(129),
        8218 => Some(130),
        402 => Some(131),
        8222 => Some(132),
        8230 => Some(133),
        8224 => Some(134),
        8225 => Some(135),
        710 => Some(136),
        8240 => Some(137),
        352 => Some(138),
        8249 => Some(139),
        338 => Some(140),
        141 => Some(141),
        381 => Some(142),
        //  a bunch more items
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

Note that I changed the function signature to return an Option instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.

I also generated a version that uses a binary search. It's pretty simple.

const BACKWARD_KEYS: &'static [u32] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 215, 247, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499,
    1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 8206,
    8207, 8215
];

const BACKWARD_VALUES: &'static [u8] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 170, 186, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
    238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 253, 254, 223
];

#[inline]
pub fn backward(code: u32) -> u8 {
    if let Ok(index) = BACKWARD_KEYS.binary_search(&code) {
        BACKWARD_VALUES[index]
    } else {
        0
    }
}

Here's a table comparing the three techniques (scroll to see entire table):

test	master				match					binary search
codec::singlebyte::tests::bench_decode_irish	3246	ns/iter	240	MB/s	3171	ns/iter	245	MB/s	2.08%
codec::singlebyte::tests::bench_decode_russian	8508	ns/iter	98	MB/s	8890	ns/iter	94	MB/s	-4.08%
codec::singlebyte::tests::bench_encode_irish	2622	ns/iter	310	MB/s	1688	ns/iter	482	MB/s	55.48%	2243	ns/iter	363	MB/s	17.10%
codec::singlebyte::tests::bench_encode_russian	6692	ns/iter	228	MB/s	10578	ns/iter	144	MB/s	-36.84%	10019	ns/iter	152	MB/s	-33.33%

Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.

It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.

I'll try running some more tests.

lifthrasiir / rust-encoding

Performance: Consider replacing lookup tables with match statements or binary search in single byte index #125