SIMD-enabled utf-8 validation

yoshuawuyts commented 4 years ago

Introduction

The "Parsing Gigabytes of JSON per second" post (ArXiv - langdale, lemire) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.

Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function much faster than conventional approaches.

Since rustc does a lot of utf-8 validation (each .rs source file needs to be valid utf-8), it got me curious about what rustc currently does. Validation seems to be done by the following routine:

https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L1500-L1618

This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a lot, so I figured it might be interesting to use a more efficient algorithm for.

Performance improvements

The post "Validating UTF-8 strings using as little as 0.7 cycles per byte" shows about an order of magnitude performance improvement on validating utf-8, going from 8 cycles per byte parsed to 0.7 cycles per byte parsed.

When passing Rust's validation code through the godbolt decompiler, from_utf8_unchecked outputs 7 instructions, and from_utf8 outputs 57 instructions. In the case of from_utf8 most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.

All examples linked have been compiled with -O -C target-cpu=native.

Also ecosystem libraries such as serde_json perform utf-8 validation in several locations, so would likely also benefit from performance improvements to Rust's utf-8 validation routines.

Implementation

There are two known Rust implementations of lemire's algorithm available in Rust today:

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The simdjson library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such as AVX2).

These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.

Future work

Lemire's post also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.

https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L4136-L4141

Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.

Conclusion

In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.

I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!

cc/ @jonas-schievink @nnethercote

References

CryZe commented 4 years ago

assembly for str::from_utf8 (godbolt) - 57 lines

This does not do any validation, it just calls to core::str::from_utf8 which is not part of the assembly.

yoshuawuyts commented 4 years ago

@CryZe ah, my bad. Yeah I was confused about the exact output, so for good measure I also copied over std's algorithm into godbolt (third link) to see what would happen. Thanks for clarifying what's actually going on!

Mark-Simulacrum commented 4 years ago

UTF-8 validation is hopefully not where the compiler spends its time :)

However, I could imagine this having some impact on "smallest possible" compile times (e.g., UI tests, hello world).

My recommendation is to replace the algorithm in core::str::from_utf8 (or where-ever it is in core) with direct use of AVX2 or some similar set, and we can then run that by perf.rust-lang.org as a loose benchmark. That's likely not tenable in reality (we would need to conditionally, likely at runtime, gate use of SIMD instructions) but would give I believe the best possible performance wins (since it would apply to all uses of from_utf8.

CryZe commented 4 years ago

Here's pretty much a 1:1 port: https://godbolt.org/z/NSop8w

Licenser commented 4 years ago

Going to add the benchmarks from isutf8 (run on my machine RUSTFLAGS='-C target-cpu=native' cargo +nightly bench):

     Running target/release/deps/validate_utf8-6efda746d5fb1b3a
Gnuplot not found, disabling plotting
random_bytes/libcore    time:   [4.9977 ns 5.0048 ns 5.0117 ns]                                  
                        thrpt:  [ 97428 GiB/s  97562 GiB/s  97702 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
random_bytes/lemire_sse time:   [69.970 us 69.985 us 70.000 us]                                    
                        thrpt:  [6.9755 GiB/s 6.9769 GiB/s 6.9784 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
random_bytes/lemire_avx time:   [62.055 us 62.084 us 62.109 us]                                    
                        thrpt:  [7.8617 GiB/s 7.8649 GiB/s 7.8685 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
random_bytes/lemire_avx_ascii_path                                                                            
                        time:   [62.549 us 62.582 us 62.615 us]
                        thrpt:  [7.7982 GiB/s 7.8023 GiB/s 7.8064 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
random_bytes/range_sse  time:   [62.763 us 62.772 us 62.782 us]                                   
                        thrpt:  [7.7774 GiB/s 7.7787 GiB/s 7.7798 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
random_bytes/range_avx  time:   [45.737 us 45.746 us 45.755 us]                                    
                        thrpt:  [10.672 GiB/s 10.674 GiB/s 10.676 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

mostly_ascii/libcore    time:   [166.94 ns 167.23 ns 167.54 ns]                                 
                        thrpt:  [17.988 GiB/s 18.022 GiB/s 18.053 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
mostly_ascii/lemire_sse time:   [430.20 ns 430.45 ns 430.68 ns]                                    
                        thrpt:  [6.9977 GiB/s 7.0014 GiB/s 7.0054 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
mostly_ascii/lemire_avx time:   [382.89 ns 383.07 ns 383.25 ns]                                    
                        thrpt:  [7.8637 GiB/s 7.8673 GiB/s 7.8711 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
mostly_ascii/lemire_avx_ascii_path                                                                            
                        time:   [65.208 ns 65.255 ns 65.311 ns]
                        thrpt:  [46.145 GiB/s 46.184 GiB/s 46.218 GiB/s]
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) low severe
  12 (12.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
mostly_ascii/range_sse  time:   [384.43 ns 384.64 ns 384.87 ns]                                   
                        thrpt:  [7.8307 GiB/s 7.8352 GiB/s 7.8395 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
mostly_ascii/range_avx  time:   [286.05 ns 286.26 ns 286.47 ns]                                   
                        thrpt:  [10.521 GiB/s 10.528 GiB/s 10.536 GiB/s]

ascii/libcore           time:   [72.975 ns 73.076 ns 73.189 ns]                          
                        thrpt:  [39.307 GiB/s 39.368 GiB/s 39.422 GiB/s]
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe
ascii/lemire_sse        time:   [423.11 ns 423.35 ns 423.62 ns]                             
                        thrpt:  [6.7912 GiB/s 6.7954 GiB/s 6.7993 GiB/s]
ascii/lemire_avx        time:   [373.82 ns 374.45 ns 375.43 ns]                             
                        thrpt:  [7.6628 GiB/s 7.6830 GiB/s 7.6958 GiB/s]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) low mild
  1 (1.00%) high severe
ascii/lemire_avx_ascii_path                                                                            
                        time:   [50.353 ns 50.588 ns 50.925 ns]
                        thrpt:  [56.492 GiB/s 56.869 GiB/s 57.133 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
ascii/range_sse         time:   [375.11 ns 375.87 ns 376.96 ns]                            
                        thrpt:  [7.6318 GiB/s 7.6538 GiB/s 7.6695 GiB/s]
Found 35 outliers among 100 measurements (35.00%)
  23 (23.00%) low severe
  8 (8.00%) high mild
  4 (4.00%) high severe
ascii/range_avx         time:   [272.39 ns 272.59 ns 272.82 ns]                            
                        thrpt:  [10.545 GiB/s 10.554 GiB/s 10.562 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

utf8/libcore            time:   [9.0154 us 9.0263 us 9.0389 us]                          
                        thrpt:  [1.7096 GiB/s 1.7119 GiB/s 1.7140 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
utf8/lemire_sse         time:   [2.1554 us 2.1568 us 2.1581 us]                             
                        thrpt:  [7.1601 GiB/s 7.1645 GiB/s 7.1693 GiB/s]
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
utf8/lemire_avx         time:   [1.9184 us 1.9188 us 1.9192 us]                             
                        thrpt:  [8.0515 GiB/s 8.0530 GiB/s 8.0547 GiB/s]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
utf8/lemire_avx_ascii_path                                                                             
                        time:   [1.4670 us 1.4679 us 1.4691 us]
                        thrpt:  [10.518 GiB/s 10.527 GiB/s 10.534 GiB/s]
utf8/range_sse          time:   [1.9426 us 1.9452 us 1.9491 us]                            
                        thrpt:  [7.9280 GiB/s 7.9439 GiB/s 7.9544 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
utf8/range_avx          time:   [1.4656 us 1.4733 us 1.4833 us]                            
                        thrpt:  [10.418 GiB/s 10.489 GiB/s 10.544 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

Benchmarking all_utf8/libcore: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s or reduce sample count to 50
all_utf8/libcore        time:   [1.8757 ms 1.8790 ms 1.8832 ms]                              
                        thrpt:  [2.1672 GiB/s 2.1721 GiB/s 2.1759 GiB/s]
all_utf8/lemire_sse     time:   [586.47 us 586.93 us 587.47 us]                                
                        thrpt:  [6.9474 GiB/s 6.9538 GiB/s 6.9593 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
all_utf8/lemire_avx     time:   [517.80 us 518.97 us 521.39 us]                                
                        thrpt:  [7.8279 GiB/s 7.8644 GiB/s 7.8822 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
all_utf8/lemire_avx_ascii_path                                                                            
                        time:   [523.97 us 524.27 us 524.63 us]
                        thrpt:  [7.7796 GiB/s 7.7849 GiB/s 7.7894 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
all_utf8/range_sse      time:   [525.94 us 527.21 us 528.57 us]                               
                        thrpt:  [7.7216 GiB/s 7.7415 GiB/s 7.7601 GiB/s]
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  8 (8.00%) high mild
  11 (11.00%) high severe
all_utf8/range_avx      time:   [392.25 us 392.91 us 393.80 us]                               
                        thrpt:  [10.364 GiB/s 10.388 GiB/s 10.405 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

all_utf8_with_garbage/libcore                                                                             
                        time:   [3.6752 ns 3.7353 ns 3.8034 ns]
                        thrpt:  [1137275 GiB/s 1158024 GiB/s 1176952 GiB/s]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
all_utf8_with_garbage/lemire_sse                                                                            
                        time:   [616.73 us 616.89 us 617.03 us]
                        thrpt:  [7.0103 GiB/s 7.0119 GiB/s 7.0136 GiB/s]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high severe
all_utf8_with_garbage/lemire_avx                                                                            
                        time:   [551.11 us 552.09 us 554.06 us]
                        thrpt:  [7.8070 GiB/s 7.8349 GiB/s 7.8488 GiB/s]
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) low severe
  9 (9.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
all_utf8_with_garbage/lemire_avx_ascii_path                                                                            
                        time:   [557.08 us 557.28 us 557.51 us]
                        thrpt:  [7.7587 GiB/s 7.7618 GiB/s 7.7646 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
all_utf8_with_garbage/range_sse                                                                            
                        time:   [554.79 us 555.10 us 555.42 us]
                        thrpt:  [7.7879 GiB/s 7.7924 GiB/s 7.7967 GiB/s]
all_utf8_with_garbage/range_avx                                                                            
                        time:   [417.05 us 417.38 us 417.74 us]
                        thrpt:  [10.355 GiB/s 10.364 GiB/s 10.372 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) low mild

Licenser commented 4 years ago

I re-read some of the Lemire algorithm and there are some key differences which might make it not that suitable for the general string validation.

The two key points are:

its build to just validate not find the error, so there is no error tracking of where an error is in the string
to make the first point worse it defers testing for errors until it checked the entire data. This makes a lot of sense in its context since it's safe to assume that for a JSON parser most input will be valid JSON and errors are an edge case. This also explains why for the benchmarks of invalid data the stdlib implementation is a lot faster.
It requires the string to be dividable by 256 bytes if it is not it will copy the last part of it to a buffer.
It allocates a struct for each check.

I think :tm: that might make them slower for small payloads.

ArniDagur commented 4 years ago

I'm at work now, so I will read this thread later in more detail, but here are some things I want to say:

The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this).

Indeed. I created the crate originally to be included in the Rust internals. I still feel that some of the algorithms need a little bit of refactoring to make them more idiomatic, before being included. I also want to add @Zwegner's algorithm (https://github.com/zwegner/faster-utf8-validator), which seems to be the fastest algorithm that has been invented to date.

But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.

If the build fails, can you open an issue? :) I would appreciate that.

argnidagur/rust-isutf8

Also: You misspelled my name ;)

CryZe commented 4 years ago

It requires the string to be dividable by 256 bytes if it is not it will copy the last part of it to a buffer.

Is that the AVX version or something? The SSE version just allocates 16 bytes on the stack for the remaining <16 bytes.

It allocates a struct for each check.

I don't even know what this means. Are we talking about this one?

struct processed_utf_bytes {
  __m128i rawbytes;
  __m128i high_nibbles;
  __m128i carried_continuations;
};

This one gets completely optimized away.

Licenser commented 4 years ago

Is that the AVX version or something? The SSE version just allocates 16 bytes.

Sorry, a bit tired, bit not byte :)

This one gets completely optimized away.

Good point

CryZe commented 4 years ago

Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?

ArniDagur commented 4 years ago

In reply to @Licenser:

its build to just validate not find the error, so there is no error tracking of where an error is in the string

I have thought about this and we have a few options:

Have, as a compile time feature, functions which for every iteration of the loop checks if we have errors, then the error can be narrowed down. This should cost around 5% performance. If we wrapped the check in an unlikely! macro (maybe a compile time feature called optimize_for_correct), this number would probably decrease. The 5% number comes from Zwegner's findings IIRC.
First go through all the text to see if it's valid. We may possible assume that most of the text is valid UTF-8. Then, in case of errors, search for the error using scalar code or some variation of option 1.

We would need to think more about this.

ArniDagur commented 4 years ago

Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?

It can be compile-time only for now. Those who self-compile (such as myself) would see an improvement.

zwegner commented 4 years ago

Hey, thanks for the shout out. I want to note that I have made several improvements to my code that aren't present in the master branch (but are in the wip branch), and it is significantly faster than when it was published (about 40%). Most importantly, I was able to completely elide the separate check for the first continuation byte.

When not memory-bandwidth bound, my algorithm is now over 2x faster (in my tests, on AVX2) than the one in Daniel Lemire's original repo (the one described in the blog post), and my SSE4 path is faster than the AVX2 path of that repo. The algorithm used in simdjson has made some improvements, but last I checked I think my algorithm is still faster.

I still need to finish writing up documentation for the new algorithm. The code's definitely more hairy, from dealing with more architectures (AVX-512 and NEON), handling pointer alignment, generating lookup tables with a Python script, etc... But I'm happy to help out if anyone wants to use it/port it.

pickfire commented 4 years ago

@Licenser Are you still working on this?

Licenser commented 4 years ago

I ported the validator https://github.com/simd-lite/faster-utf8-validator-rs but from what I understand this all falls apart with the stdlib not being able to use CPU native features under the hood?

hanna-kruppe commented 4 years ago

It would need to be gated under runtime CPUID checks or left disabled by default, gated under a cfg(target_feature=...) that is only true if users rebuild libstd with those target features enabled (currently an unstable Cargo feature, but should get more accessible in the future). The latter is easy but only helps software that can afford to only run on newer processors, the former faces the challenge of not regressing performance but would help more users if it worked.

ifuncs or similar mechanisms might also be an option on certain platforms, but they're not very portable and have obscure and hard-to-satisfy constraints. Manually emulating them with function pointers initialized on first call might have more overhead, not sure.

bprosnitz commented 4 years ago

My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because:

AVX has a startup time that won't be measured when repeatedly looping in a microbenchmark
AVX draws a lot of power and can end up slowing down other cores due to power draw and increased heat

Curious what others think about the suitability? Does it make sense only for sufficiently large strings?

milkey-mouse commented 4 years ago

This article goes into detail on when it makes sense to use AVX (AVX-512 especially). The most relevant parts:

Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).

The bar for light AVX-512 is lower. Even if the work is spread on all cores, you may only get a 15% frequency on some chips like a Xeon Gold. So you only have to check that AVX-512 gives you a greater than 15% gain for your overall application on a per-cycle basis.

Some older CPUs downclock all cores when any of them are using AVX2 instructions, but for newer ones, they mostly only affect the core running them. Also, the SIMD instructions used for string validation would fall in the "light" category of instructions as they don't involve the floating-point unit.

As @bprosnitz mentioned, take them with a grain of salt, but microbenchmarks certainly imply there is something to be gained in using an accelerated validator.

lemire commented 4 years ago

Note that we have an updated validator called lookup 4...

https://github.com/simdjson/simdjson/pull/993

It is going to be really hard to beat.

@bprosnitz

My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because: AVX has a startup time that won't be measured when repeatedly looping in a microbenchmark, AVX draws a lot of power and can end up slowing down other cores due to power draw and increased heat, Curious what others think about the suitability? Does it make sense only for sufficiently large strings?

Firstly, let us set aside AVX-512 and "heavy" (numerical) AVX2 instructions. They have their uses (e.g., in machine learning, simulation). But that's probably not what you have in mind.

This being said...

Regarding power usage, it is generally true that the faster code is the code that uses less energy. So if you can multiply your speed using NEON, SSE, AVX, go ahead. You'll come up on top. It is a bit like being concerned with climate change and observing that buses and trains use more energy than cars. They use more energy in total, but less energy per work done. So you have to hold the work constant if you are going to make comparisons. Does it take more energy to do 4 additions, or to use one instruction that does 4 additions at once?

So SIMD instructions are the public transportation of computing. They are green and should be used as much as possible. (Again, I am setting aside AVX-512 and numerical AVX2 instructions that are something more controversial.)

Regarding the fear that SIMD instructions are somehow exotic and rare, and that if you ever use it, you will trigger a chain reaction of slowness... You are using AVX all the time... Read this commit where the committer identified that the hot function in his benchmark was __memmove_avx_unaligned_erms. You can bet that this function is AVX-based. The Golang runtime uses AVX, Glibc uses AVX, LLVM uses AVX, Java, and so forth. Even PHP uses SIMD instructions for some string algorithms. And yes, Rust programs use AVX or other SIMD instructions.

lemire commented 4 years ago

@hanna-kruppe

Manually emulating them with function pointers initialized on first call might have more overhead

I don't think so. We have a well tested approach. It is as simple as that...

internal::atomic_ptr<const implementation> active_implementation{&detect_best_supported_implementation_on_first_use_singleton};

const implementation *detect_best_supported_implementation_on_first_use::set_best() const noexcept {
  return active_implementation = available_implementations.detect_best_supported();
}

So you just need an atomic function pointer.

Obviously, you do not get inlining, but that's about the only cost. Loading an atomic pointer is no more expensive, really, than loading an ordinary pointer. So this is pretty much free... except for the first time. You pay a price of first use, but that's not a fundamental limitation: you could set the best function any time, including at startup.

zwegner commented 4 years ago

Note that we have an updated validator called lookup 4...

simdjson/simdjson#993

It is going to be really hard to beat.

I mentioned this off-hand on Twitter, but to clarify, in my benchmarks of pure UTF-8 validation, my code with scalar continuation byte checks and unaligned loads is still faster than lookup4 (or at least my translation of lookup4 to C). The difference depends on compiler (on my HSW, testing 1M of UTF-8, GCC gives .227 (me) vs .319 (L4) cycles/byte, while LLVM has .240 (me) vs .266 (L4)).

The picture gets more complicated in the context of simdjson, when plenty of other code is competing for the same execution ports (and the best algorithm is less clear), but I think in the case of Rust, the pure UTF-8 microbenchmarks are probably more representative.

milkey-mouse commented 4 years ago

@lemire to clarify, my benchmarks are already using the lookup4 implementation in simdjson (from the pull request's branch).

lemire commented 4 years ago

@milkey-mouse Fantastic.

cc @jkeiser

lemire commented 4 years ago

Blog post: The cost of runtime dispatch.

jyn514 commented 4 years ago

Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.

This was recently implemented in #74066.

lemire commented 4 years ago

Of some relevance, we are publishing a research paper on the topic:

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice & Experience (to appear)

hkratz commented 3 years ago

I have just released simdutf8, a std::str::from_utf8()-compatible SIMD UTF-8 validation crate, based on @lemire's lookup4 algorithm, but slightly modified. x86/x86-64 support only so far, but armv7/armv8 neon support is planned.

mlindner commented 3 years ago

Apologies if this was already discussed and I missed it, but why is runtime detection needed at all? That seems fundamentally wrong. SIMD should get optionally used at compile time based on the system. Runtime detection should be left to the developer in the form of fat binaries, not the library. That's how everything I know of uses SIMD unless it's doing some kind of JIT, but there's no need for a JIT here.

If the issue is specifically with x86_64 and older CPUs, then you still shouldn't need runtime detection and multiple assembly implementations can be provided based on the supported instruction sets at build time. If the person using the library wants to distribute a binary, they can specify a CPU version to the compiler and build it for that CPU as opposed to a faster one.

Edit: Reading elsewhere apparently the core library is only available in binary form? Is that correct? Cross-compilation seems impossible if that were really the case. It sounds like the correct answer here is to create a new issue to allow Rust to rebuild the core library as part of the optimized build. Adding runtime detection into the core library seems like barking up the completely wrong tree. That would allow optimized implementations of other algorithms in the core library as well and also allow vector optimizations as well during build time. (To be honest if the core library is actually distributed in binary form, I don't know how it got to be this way, it should get built once upon the first time building anything on your system and then the cached build result used from thence forward.)

bjorn3 commented 3 years ago

Reading elsewhere apparently the core library is only available in binary form? Is that correct? Cross-compilation seems impossible if that were really the case. It sounds like the correct answer here is to create a new issue to allow Rust to rebuild the core library as part of the optimized build.

It is possible to re-compile the standard library yourself. The unstable -Zbuild-std flag of cargo makes this a lot easier. This however increases compilation time and unless you use -Ctarget-cpu to select a minimum requirement on which cpu your program runs on, SIMD will still not be used. Most of the time people don't use -Ctarget-cpu in combination with -Zbuild-std and thus are stuck with a slow validation method. As it is merely slow and not broken, people are unlikely to realize that they can use -Ctarget-cpu and -Zbuild-std (and if they are on stable they can't even use -Zbuild-std yet). As for cross-compilation -Zbuild-std is made for this, but you can also use rustup target add to get the standard library compiled for arbitrary supported targets.

(To be honest if the core library is actually distributed in binary form, I don't know how it got to be this way, it should get built once upon the first time building anything on your system and then the cached build result used from thence forward.)

If -Ctarget-cpu=native were used by default, that would make executables compiled with rustc non-portable as they will only work on your system or newer systems. In addition -Ctarget-cpu=native simply doesn't work for cross-compilation as you aren't targeting your native cpu.

mlindner commented 3 years ago

Thanks for explaining, but this all seems backwards to me. -Ctarget-cpu=native should be the default and if people want to distribute a binary then additional options should be specified to specify the target platform. The fact that it's even an unstable option to rebuild the standard library also seems backwards. The standard library at the very least should get built every time you build a project (with cross-project caching of the native version, and local caching if a non-native version is used).

pickfire commented 3 years ago

Thanks for explaining, but this all seems backwards to me. -Ctarget-cpu=native should be the default and if people want to distribute a binary then additional options should be specified to specify the target platform. The fact that it's even an unstable option to rebuild the standard library also seems backwards. The standard library at the very least should get built every time you build a project (with cross-project caching of the native version, and local caching if a non-native version is used).

I think you should learn that from a packaging standpoint we would not want to package so many binaries such that it is only used by certain processor that certain people have. So instead of creating one binary we will have to create hundreds of binaries. This is also why linux distributions will only build with a very general SIMD that almost every one have during compile time. But with runtime detection, it can keep a single binary at the cost of runtime checks when the application starts and use a faster version without many drawbacks, I believe the performance cost is insignificant compared to the cost for building it for very specific CPU and keeping all the binaries for different specific CPU. This is also how ripgrep does SIMD, which is using runtime detection. Try discussing it with package distributor then you will understand more why is this the case.

BurntSushi commented 3 years ago

That's how everything I know of uses SIMD unless it's doing some kind of JIT, but there's no need for a JIT here.

To follow up with what @pickfire said:

This is also how ripgrep does SIMD, which is using runtime detection.

And ripgrep is in good company. glibc does the same thing.

ripgrep does detection via its dependencies:

mlindner commented 3 years ago

Perhaps my view is just too narrow as I'm used to dealing with only fixed server platforms (only targeting one specific CPU) where we build everything destined for the system and my local Mac OS which has a constrained CPU instruction set (and fat binaries for ARM vs x86_64). Either way, I think it should never be excluded as an option if people wish it to be at compile time. Thanks for taking your time and explaining.

BurntSushi commented 3 years ago

@mlindner Yeah indeed. If I only had to deploy to a fixed set of server platforms where I knew exactly what ISA extensions were needed, then I'd totally skip the complexity of runtime CPU feature detection and just rely on compile time.

pickfire commented 3 years ago

And ripgrep is in good company. glibc does the same thing.

So I guess feature detection will be done multiple times? Wouldn't it be better to only have the cost of feature detection once across all crates?

hkratz commented 3 years ago

So I guess feature detection will be done multiple times? Wouldn't it be better to only have the cost of feature detection once across all crates?

Feature detection is cached in std. Also most crates use an atomic function pointer, so the decision what implementation to use is done only once.

bjorn3 commented 3 years ago

If you use -Ctarget-cpu or -Ctarget-feature to enable target features, "runtime" detection through is_x86_feature_detected!() for those target features known to be enabled at compile time will compile to true without any runtime detection taking place at all.

hkratz commented 3 years ago

If you use -Ctarget-cpu or -Ctarget-feature to enable target features, "runtime" detection through is_x86_feature_detected!() for those target features known to be enabled at compile time will compile to true without any runtime detection taking place at all.

That did never work for me (Godbolt).

bjorn3 commented 3 years ago

Looks like the cfg!() that is supposed to do this is inside libstd instead of inside the is_x86_feature_detected!() macro: https://github.com/rust-lang/stdarch/blob/d385078ee7cd8656c4a43bcc321578d0f65ba691/crates/std_detect/src/detect/macros.rs#L97

I opened https://github.com/rust-lang/stdarch/issues/1135 for this.

hkratz commented 3 years ago

The compatible API in simdutf8 v0.1.1 is now as a fast on small valid inputs and up to 22 times faster on large valid non-ASCII strings (three times as fast on ASCII) - x86-64 with AVX 2 support.

If there is a way forward to get this into std or core I would be happy to work on it.

pickfire commented 3 years ago

@hkratz What about arm and those cpu without simd? Will there be a regression?

hkratz commented 3 years ago

@hkratz What about arm and those cpu without simd? Will there be a regression?

On arm we would just use the current standard library from_utf8() implementation until the required SIMD intrinsics are implemented. On older x86 CPUs without SSE 4.2 or AVX 2 we fall back to the scalar implementation at runtime after the initial detection. The overhead is just one additional fn pointer call on subsequent from_utf8() calls.

Voultapher commented 3 years ago

@m-ou-se having read the entire conversation in this thread, it's not entirely clear to me what is blocking this from progressing. Would this set a precedent for runtime feature detection, and or something else? I'd appreciate if you, or someone else from the lib team could help me and others like @hkratz understand what would be need to get this into libcore.

bjorn3 commented 3 years ago

@Voultapher The main problem is how to detect which target features are supported from inside libcore. Except for x86, all platforms require asking the OS for which target features are supported by the CPU. libcore by definition doesn't know anything about the OS and as such can't know which target features are supported by the CPU. It is not safe to use target features before checking if they are supported. The results of doing so can range from crashes to unexpected instructions executing. In any case it will not result in the correct results.

lemire commented 3 years ago

The Rust runtime does rely on runtime feature detection, but, as far as I can tell, only indirectly (e.g., by calling memcpy).

Languages like Go, Java, C/C++... do use (quite often) runtime feature detection as part of the core libraries.

A sensible way forward would be to add runtime CPU feature detection directly in the Rust core.

Voultapher commented 3 years ago

I agree, there are enough examples that demonstrate it can be done within the larger constrains of the Rust language. A potential nice side-effect would be efficient and correct feature detection in the standard library, there are enough libraries and tools ripgrep, memchr etc. that have to hand roll it today. And tbh intrinsic support feels incomplete without feature detection.

bjorn3 commented 3 years ago

CPU feature detection is supported as part of libstd, as this is the part of the standard library that is allowed to interface with the OS, but libcore does not. How should CPU feature detection work when compiling a kernel against libcore? That would require some way to pass the supported features to libcore. What if the kernel works right now with stable rustc and doesn't provide the supported features to libcore as there is currently no such thing. How can the target feature detection support be added to libcore without breaking this kernel that doesn't pass the supported target features to libcore.

Voultapher commented 3 years ago

To clarify, I poorly worded my earlier comment. I'm thinking about both feature detection and corresponding dispatch. For example implemented with atomic function pointer overwrite.

I'm not deep enough into the Rust standard library to even fully understand the implications of what you said. I'm looking at this from a user perspective. Many Rust applications use the standard library, not only libcore. Part of the standard library interface are directly and transitively calls that perform UTF-8 validation. Somewhere in this stack of abstractions, feature detection and 'dispatch' can happen to use more efficient versions.

Naively, if it can't easily be added to libcore, why not add it to the stdlib only, as first step?

hkratz commented 3 years ago

CPU feature detection is supported as part of libstd, as this is the part of the standard library that is allowed to interface with the OS, but libcore does not. How should CPU feature detection work when compiling a kernel against libcore? That would require some way to pass the supported features to libcore. What if the kernel works right now with stable rustc and doesn't provide the supported features to libcore as there is currently no such thing. How can the target feature detection support be added to libcore without breaking this kernel that doesn't pass the supported target features to libcore.

One way would be to tell the core library that SIMD implementations are supported (core::somewhere::set_cpu_features_enabled(...)) or setting a callback which returns if they are supported (core::somewhere::set_cpu_features_check_fn(...)) from the std startup code.

For aarch64 detection would not even be needed because ASIMD/neon is enabled by default on nightly Rust so LLVM autovectorizes to ASIMD/neon code already.

hkratz commented 3 years ago

Naively, if it can't easily be added to libcore, why not add it to the stdlib only as first step?

std::str::from_utf8() is just an alias (via pub use) for core::str::from_utf8(). Changing that would be a breaking change.

rust-lang / rust