There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.
So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.
/// Some Latin-1 text to test
//
// the first few sentences of the article "An Ghaeilge" from Irish Wikipedia.
// https://ga.wikipedia.org/wiki/An_Ghaeilge
pub static IRISH_TEXT: &'static str =
"Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar \
uirthi corruair), agus ceann den dtrí cinn de theangacha Ceilteacha ar a dtugtar na \
teangacha Gaelacha (.i. an Ghaeilge, Gaeilge na hAlban agus Gaeilge Mhanann) go háirithe. \
Labhraítear in Éirinn go príomha í, ach tá cainteoirí Gaeilge ina gcónaí in áiteanna eile ar \
fud an domhain. Is í an teanga náisiúnta nó dhúchais agus an phríomhtheanga oifigiúil i \
bPoblacht na hÉireann í an Ghaeilge. Tá an Béarla luaite sa Bhunreacht mar theanga oifigiúil \
eile. Tá aitheantas oifigiúil aici chomh maith i dTuaisceart Éireann, atá mar chuid den \
Ríocht Aontaithe. Ar an 13 Meitheamh 2005 d'aontaigh airí gnóthaí eachtracha an Aontais \
Eorpaigh glacadh leis an nGaeilge mar theanga oifigiúil oibre san AE";
pub static RUSSIAN_TEXT: &'static str =
"Ру?сский язы?к Информация о файле слушать)[~ 3] один из восточнославянских языков, \
национальный язык русского народа. Является одним из наиболее распространённых языков мира \
шестым среди всех языков мира по общей численности говорящих и восьмым по численности \
владеющих им как родным[9]. Русский является также самым распространённым славянским \
языком[10] и самым распространённым языком в Европе ? географически и по числу носителей \
языка как родного[7]. Русский язык ? государственный язык Российской Федерации, один из \
двух государственных языков Белоруссии, один из официальных языков Казахстана, Киргизии и \
некоторых других стран, основной язык международного общения в Центральной Евразии, в \
Восточной Европе, в странах бывшего Советского Союза, один из шести рабочих языков ООН, \
ЮНЕСКО и других международных организаций[11][12][13].";
#[bench]
fn bench_encode_irish(bencher: &mut test::Bencher) {
bencher.bytes = IRISH_TEXT.len() as u64;
bencher.iter(|| {
test::black_box(
WINDOWS_1252.encode(&ASCII_TEXT, EncoderTrap::Strict)
)
})
}
#[bench]
fn bench_decode_irish(bencher: &mut test::Bencher) {
let bytes = WINDOWS_1252.encode(IRISH_TEXT, EncoderTrap::Strict).unwrap();
bencher.bytes = bytes.len() as u64;
bencher.iter(|| {
test::black_box(
WINDOWS_1252.decode(&bytes, DecoderTrap::Strict)
)
})
}
#[bench]
fn bench_encode_russian(bencher: &mut test::Bencher) {
bencher.bytes = RUSSIAN_TEXT.len() as u64;
bencher.iter(|| {
test::black_box(
ISO_8859_5.encode(&RUSSIAN_TEXT, EncoderTrap::Strict)
)
})
}
#[bench]
fn bench_decode_russian(bencher: &mut test::Bencher) {
let bytes = ISO_8859_5.encode(RUSSIAN_TEXT, EncoderTrap::Strict).unwrap();
bencher.bytes = bytes.len() as u64;
bencher.iter(|| {
test::black_box(
ISO_8859_5.decode(&bytes, DecoderTrap::Strict)
)
})
}
I picked the windows-1252 encoding because it's similar to the old latin-1 standard and can encode the special characters in the Irish text I grabbed, and iso-8859-5 for similar reasons for the Russian test.
I rewrote gen_index.py to create match statements instead of building a lookup table. You get something like this:
Note that I changed the function signature to return an Option instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.
I also generated a version that uses a binary search. It's pretty simple.
Here's a table comparing the three techniques (scroll to see entire table):
test
master
match
binary search
codec::singlebyte::tests::bench_decode_irish
3246
ns/iter
240
MB/s
3171
ns/iter
245
MB/s
2.08%
codec::singlebyte::tests::bench_decode_russian
8508
ns/iter
98
MB/s
8890
ns/iter
94
MB/s
-4.08%
codec::singlebyte::tests::bench_encode_irish
2622
ns/iter
310
MB/s
1688
ns/iter
482
MB/s
55.48%
2243
ns/iter
363
MB/s
17.10%
codec::singlebyte::tests::bench_encode_russian
6692
ns/iter
228
MB/s
10578
ns/iter
144
MB/s
-36.84%
10019
ns/iter
152
MB/s
-33.33%
Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.
It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.
The current technique for building the single byte "forward" and "backward" function is to generate lookup tables using
gen_index.py
Here's an example generated file: https://github.com/lifthrasiir/rust-encoding/blob/master/src/index/singlebyte/windows_1252.rs
There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.
So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.
I picked the
windows-1252
encoding because it's similar to the oldlatin-1
standard and can encode the special characters in the Irish text I grabbed, andiso-8859-5
for similar reasons for the Russian test.I rewrote
gen_index.py
to creatematch
statements instead of building a lookup table. You get something like this:Note that I changed the function signature to return an
Option
instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.I also generated a version that uses a binary search. It's pretty simple.
Here's a table comparing the three techniques (scroll to see entire table):
Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.
It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.
I'll try running some more tests.