SIMD vectorization behind a feature flag

dtolnay commented 8 years ago

RapidJSON does quite a lot of this.

maciejhirsz commented 8 years ago

It would be interesting to see hermetic tests done for that vs iteration with a LUT. You need to:

check how many vectors you can use (len >> 4, not a big deal)
vectorize the bytes - memcpy should be fast
do 4? instructions (check for backslash, quote, control characters with the exception of whitespace)
flatten the boolean vector to a single value.

And then you still need to do something about the remaining len % 16 bytes.

maciejhirsz commented 8 years ago

I made a very rough test in json-rust. Using this as test data:

{"foo":"This is a super long string, omg so long, why is it so long? It is so long because it needs to contain many multiplications of 16 bytes in order for us to get any benefits from SIMD at all! That's why! It's kind of annoying like that, but hey, what can you do..."}

Without SIMD:

test json_rust_parse ... bench:         414 ns/iter (+/- 143) = 657 MB/s

With SIMD:

test json_rust_parse ... bench:         267 ns/iter (+/- 28) = 1018 MB/s

This is pretty much the best case scenario for this kind of a thing, but it works. The implementation is pretty simple, before I start running a loop checking bytes one by one, I do this:

        for _ in 0 .. ($parser.length - $parser.index) >> 4 {
            let bytes = simd::u8x16::load($parser.source.as_bytes(), $parser.index);

            if (bytes.lt(CT_SIMD) | bytes.eq(BS_SIMD) | bytes.eq(QU_SIMD)).any() {
                break;
            }

            $parser.index += 16;
        }

Where CT_SIMD is a splat of 0x20, BS_SIMD is a splat of 0x5C and QU_SIMD is a splat of 0x22 (as you'd expect).

Gonna run this against json-benchmark, will come back with numbers.

maciejhirsz commented 8 years ago

On json-benchmark the numbers are pretty much the same on all tests. The version with SIMD is on dev branch if you want to try on your hardware. I only did it for parsing, not serializing.

dtolnay commented 8 years ago

Same here, even a little slower. It may be that unaligned u8x16 loads are slow - you don't have this part that establishes 16-byte alignment. That would also serve to mitigate the impact on parsing short strings.

I would be interested in benchmarking two variants of your code - one that aligns to 16 bytes before the SIMD loop like in RapidJSON, and one that goes 16 bytes past the first 16-byte alignment before starting SIMD. It is all about balancing no-SIMD for short strings (which are most strings) and SIMD for long strings (which are slow otherwise).

maciejhirsz commented 8 years ago

I think the simd lib does the alignment via the repr annotation:

/// A SIMD vector of 16 `u8`s.
#[repr(simd)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[derive(Debug, Copy)]
pub struct u8x16(u8, u8, u8, u8, u8, u8, u8, u8,
                 u8, u8, u8, u8, u8, u8, u8, u8);

I've been only catching up with low level stuff this year, so things like byte alignments are still semi exotic to me. I've only looked this up, and the author of the answer is the author of the simd implementation.

dtolnay commented 8 years ago

I am talking about alignment of $parser.source[$parser.index]. If that is not 16-byte aligned, there is nothing the SIMD library can do about it.

maciejhirsz commented 8 years ago

Gotcha! I assume casting pointer into usize and grabbing modulo 16 out of it, then running over that many bytes with regular check should be sufficient.

dtolnay commented 8 years ago

Yes.

And then for the other test, that many bytes + 16 more.

maciejhirsz commented 8 years ago

I don't seem to get any statistically significant difference. Increased the string length to 3155 characters.

No SIMD:

test json_rust_parse ... bench:       2,592 ns/iter (+/- 1,071) = 1221 MB/s

SIMD, no alignment:

test json_rust_parse ... bench:         683 ns/iter (+/- 481) = 4633 MB/s

SIMD, with alignment:

test json_rust_parse ... bench:         754 ns/iter (+/- 502) = 4197 MB/s

SIMD, with alignment offset by 1:

test json_rust_parse ... bench:         755 ns/iter (+/- 358) = 4192 MB/s

Pushed to dev along with quickly hacked benchmarks if you want to play around.

maciejhirsz commented 8 years ago

Ok, got it, seems u8x16::load doesn't actually use memcpy. Reworked into this mess:

let mut bytes: simd::u8x16 = mem::uninitialized();
let bytes_ptr: *mut u8 = mem::transmute(&mut bytes);
ptr::copy_nonoverlapping(
    $parser.byte_ptr.offset($parser.index as isize),
    bytes_ptr,
    16
);

Now I get expected, results. With alignment:

test json_rust_parse ... bench:         625 ns/iter (+/- 321) = 5064 MB/s

With alignment offset 1:

test json_rust_parse ... bench:         651 ns/iter (+/- 367) = 4861 MB/s

Difference isn't staggering, but it's there.

alexbool commented 8 years ago

Wow, that's interesting! I also have to say that all modern (at least Intel's) CPUs have AVX2 in them, which allows 256 bit vectors. Would be nice to try. This must be #[cfg]'d of course.

arthurprs commented 6 years ago

SIMD is on track for stabilization 🎉

ijl commented 5 years ago

Has anyone done anything with this?

mmstick commented 5 years ago

This recently became a thing: https://github.com/lemire/simdjson

Licenser commented 5 years ago

I ported simdjson to rust the performance boost SIMD gives is significant. Even skipping the tape tricks and adopting a more rust-ergonomic data representation it is up to 300% faster than vanilla-rust serde-json

JimChenWYU commented 1 year ago

How has the discussion progressed so far? It's been 7 years already.

ifsheldon commented 4 months ago

sonic-rs also shows promising results on json_benchmark. It's developed by Bytedance and its APIs are very similar to serde_json.

serde-rs / json

SIMD vectorization behind a feature flag #120