Open dtolnay opened 8 years ago
It would be interesting to see hermetic tests done for that vs iteration with a LUT. You need to:
len >> 4
, not a big deal)And then you still need to do something about the remaining len % 16
bytes.
I made a very rough test in json-rust. Using this as test data:
{"foo":"This is a super long string, omg so long, why is it so long? It is so long because it needs to contain many multiplications of 16 bytes in order for us to get any benefits from SIMD at all! That's why! It's kind of annoying like that, but hey, what can you do..."}
Without SIMD:
test json_rust_parse ... bench: 414 ns/iter (+/- 143) = 657 MB/s
With SIMD:
test json_rust_parse ... bench: 267 ns/iter (+/- 28) = 1018 MB/s
This is pretty much the best case scenario for this kind of a thing, but it works. The implementation is pretty simple, before I start running a loop checking bytes one by one, I do this:
for _ in 0 .. ($parser.length - $parser.index) >> 4 {
let bytes = simd::u8x16::load($parser.source.as_bytes(), $parser.index);
if (bytes.lt(CT_SIMD) | bytes.eq(BS_SIMD) | bytes.eq(QU_SIMD)).any() {
break;
}
$parser.index += 16;
}
Where CT_SIMD
is a splat of 0x20
, BS_SIMD
is a splat of 0x5C
and QU_SIMD
is a splat of 0x22
(as you'd expect).
Gonna run this against json-benchmark, will come back with numbers.
On json-benchmark
the numbers are pretty much the same on all tests. The version with SIMD is on dev
branch if you want to try on your hardware. I only did it for parsing, not serializing.
Same here, even a little slower. It may be that unaligned u8x16 loads are slow - you don't have this part that establishes 16-byte alignment. That would also serve to mitigate the impact on parsing short strings.
I would be interested in benchmarking two variants of your code - one that aligns to 16 bytes before the SIMD loop like in RapidJSON, and one that goes 16 bytes past the first 16-byte alignment before starting SIMD. It is all about balancing no-SIMD for short strings (which are most strings) and SIMD for long strings (which are slow otherwise).
I think the simd lib does the alignment via the repr annotation:
/// A SIMD vector of 16 `u8`s.
#[repr(simd)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[derive(Debug, Copy)]
pub struct u8x16(u8, u8, u8, u8, u8, u8, u8, u8,
u8, u8, u8, u8, u8, u8, u8, u8);
I've been only catching up with low level stuff this year, so things like byte alignments are still semi exotic to me. I've only looked this up, and the author of the answer is the author of the simd implementation.
I am talking about alignment of $parser.source[$parser.index]
. If that is not 16-byte aligned, there is nothing the SIMD library can do about it.
Gotcha! I assume casting pointer into usize and grabbing modulo 16 out of it, then running over that many bytes with regular check should be sufficient.
Yes.
And then for the other test, that many bytes + 16 more.
I don't seem to get any statistically significant difference. Increased the string length to 3155 characters.
No SIMD:
test json_rust_parse ... bench: 2,592 ns/iter (+/- 1,071) = 1221 MB/s
SIMD, no alignment:
test json_rust_parse ... bench: 683 ns/iter (+/- 481) = 4633 MB/s
SIMD, with alignment:
test json_rust_parse ... bench: 754 ns/iter (+/- 502) = 4197 MB/s
SIMD, with alignment offset by 1:
test json_rust_parse ... bench: 755 ns/iter (+/- 358) = 4192 MB/s
Pushed to dev
along with quickly hacked benchmarks if you want to play around.
Ok, got it, seems u8x16::load
doesn't actually use memcpy. Reworked into this mess:
let mut bytes: simd::u8x16 = mem::uninitialized();
let bytes_ptr: *mut u8 = mem::transmute(&mut bytes);
ptr::copy_nonoverlapping(
$parser.byte_ptr.offset($parser.index as isize),
bytes_ptr,
16
);
Now I get expected, results. With alignment:
test json_rust_parse ... bench: 625 ns/iter (+/- 321) = 5064 MB/s
With alignment offset 1:
test json_rust_parse ... bench: 651 ns/iter (+/- 367) = 4861 MB/s
Difference isn't staggering, but it's there.
Wow, that's interesting!
I also have to say that all modern (at least Intel's) CPUs have AVX2 in them, which allows 256 bit vectors. Would be nice to try. This must be #[cfg]
'd of course.
SIMD is on track for stabilization 🎉
Has anyone done anything with this?
This recently became a thing: https://github.com/lemire/simdjson
I ported simdjson to rust the performance boost SIMD gives is significant. Even skipping the tape tricks and adopting a more rust-ergonomic data representation it is up to 300% faster than vanilla-rust serde-json
How has the discussion progressed so far? It's been 7 years already.
sonic-rs also shows promising results on json_benchmark. It's developed by Bytedance and its APIs are very similar to serde_json.
RapidJSON does quite a lot of this.