Open yoshuawuyts opened 4 years ago
assembly for str::from_utf8 (godbolt) - 57 lines
This does not do any validation, it just calls to core::str::from_utf8
which is not part of the assembly.
@CryZe ah, my bad. Yeah I was confused about the exact output, so for good measure I also copied over std's algorithm into godbolt (third link) to see what would happen. Thanks for clarifying what's actually going on!
UTF-8 validation is hopefully not where the compiler spends its time :)
However, I could imagine this having some impact on "smallest possible" compile times (e.g., UI tests, hello world).
My recommendation is to replace the algorithm in core::str::from_utf8 (or where-ever it is in core) with direct use of AVX2 or some similar set, and we can then run that by perf.rust-lang.org as a loose benchmark. That's likely not tenable in reality (we would need to conditionally, likely at runtime, gate use of SIMD instructions) but would give I believe the best possible performance wins (since it would apply to all uses of from_utf8.
Here's pretty much a 1:1 port: https://godbolt.org/z/NSop8w
Going to add the benchmarks from isutf8 (run on my machine RUSTFLAGS='-C target-cpu=native' cargo +nightly bench
):
Running target/release/deps/validate_utf8-6efda746d5fb1b3a
Gnuplot not found, disabling plotting
random_bytes/libcore time: [4.9977 ns 5.0048 ns 5.0117 ns]
thrpt: [ 97428 GiB/s 97562 GiB/s 97702 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low severe
2 (2.00%) low mild
1 (1.00%) high mild
random_bytes/lemire_sse time: [69.970 us 69.985 us 70.000 us]
thrpt: [6.9755 GiB/s 6.9769 GiB/s 6.9784 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
random_bytes/lemire_avx time: [62.055 us 62.084 us 62.109 us]
thrpt: [7.8617 GiB/s 7.8649 GiB/s 7.8685 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) low mild
random_bytes/lemire_avx_ascii_path
time: [62.549 us 62.582 us 62.615 us]
thrpt: [7.7982 GiB/s 7.8023 GiB/s 7.8064 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
random_bytes/range_sse time: [62.763 us 62.772 us 62.782 us]
thrpt: [7.7774 GiB/s 7.7787 GiB/s 7.7798 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low severe
2 (2.00%) high mild
random_bytes/range_avx time: [45.737 us 45.746 us 45.755 us]
thrpt: [10.672 GiB/s 10.674 GiB/s 10.676 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
mostly_ascii/libcore time: [166.94 ns 167.23 ns 167.54 ns]
thrpt: [17.988 GiB/s 18.022 GiB/s 18.053 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
mostly_ascii/lemire_sse time: [430.20 ns 430.45 ns 430.68 ns]
thrpt: [6.9977 GiB/s 7.0014 GiB/s 7.0054 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
mostly_ascii/lemire_avx time: [382.89 ns 383.07 ns 383.25 ns]
thrpt: [7.8637 GiB/s 7.8673 GiB/s 7.8711 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
mostly_ascii/lemire_avx_ascii_path
time: [65.208 ns 65.255 ns 65.311 ns]
thrpt: [46.145 GiB/s 46.184 GiB/s 46.218 GiB/s]
Found 24 outliers among 100 measurements (24.00%)
2 (2.00%) low severe
12 (12.00%) low mild
2 (2.00%) high mild
8 (8.00%) high severe
mostly_ascii/range_sse time: [384.43 ns 384.64 ns 384.87 ns]
thrpt: [7.8307 GiB/s 7.8352 GiB/s 7.8395 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
mostly_ascii/range_avx time: [286.05 ns 286.26 ns 286.47 ns]
thrpt: [10.521 GiB/s 10.528 GiB/s 10.536 GiB/s]
ascii/libcore time: [72.975 ns 73.076 ns 73.189 ns]
thrpt: [39.307 GiB/s 39.368 GiB/s 39.422 GiB/s]
Found 14 outliers among 100 measurements (14.00%)
6 (6.00%) high mild
8 (8.00%) high severe
ascii/lemire_sse time: [423.11 ns 423.35 ns 423.62 ns]
thrpt: [6.7912 GiB/s 6.7954 GiB/s 6.7993 GiB/s]
ascii/lemire_avx time: [373.82 ns 374.45 ns 375.43 ns]
thrpt: [7.6628 GiB/s 7.6830 GiB/s 7.6958 GiB/s]
Found 10 outliers among 100 measurements (10.00%)
9 (9.00%) low mild
1 (1.00%) high severe
ascii/lemire_avx_ascii_path
time: [50.353 ns 50.588 ns 50.925 ns]
thrpt: [56.492 GiB/s 56.869 GiB/s 57.133 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
ascii/range_sse time: [375.11 ns 375.87 ns 376.96 ns]
thrpt: [7.6318 GiB/s 7.6538 GiB/s 7.6695 GiB/s]
Found 35 outliers among 100 measurements (35.00%)
23 (23.00%) low severe
8 (8.00%) high mild
4 (4.00%) high severe
ascii/range_avx time: [272.39 ns 272.59 ns 272.82 ns]
thrpt: [10.545 GiB/s 10.554 GiB/s 10.562 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
utf8/libcore time: [9.0154 us 9.0263 us 9.0389 us]
thrpt: [1.7096 GiB/s 1.7119 GiB/s 1.7140 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
utf8/lemire_sse time: [2.1554 us 2.1568 us 2.1581 us]
thrpt: [7.1601 GiB/s 7.1645 GiB/s 7.1693 GiB/s]
Found 15 outliers among 100 measurements (15.00%)
11 (11.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
utf8/lemire_avx time: [1.9184 us 1.9188 us 1.9192 us]
thrpt: [8.0515 GiB/s 8.0530 GiB/s 8.0547 GiB/s]
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low severe
3 (3.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
utf8/lemire_avx_ascii_path
time: [1.4670 us 1.4679 us 1.4691 us]
thrpt: [10.518 GiB/s 10.527 GiB/s 10.534 GiB/s]
utf8/range_sse time: [1.9426 us 1.9452 us 1.9491 us]
thrpt: [7.9280 GiB/s 7.9439 GiB/s 7.9544 GiB/s]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
utf8/range_avx time: [1.4656 us 1.4733 us 1.4833 us]
thrpt: [10.418 GiB/s 10.489 GiB/s 10.544 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low mild
2 (2.00%) high mild
9 (9.00%) high severe
Benchmarking all_utf8/libcore: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s or reduce sample count to 50
all_utf8/libcore time: [1.8757 ms 1.8790 ms 1.8832 ms]
thrpt: [2.1672 GiB/s 2.1721 GiB/s 2.1759 GiB/s]
all_utf8/lemire_sse time: [586.47 us 586.93 us 587.47 us]
thrpt: [6.9474 GiB/s 6.9538 GiB/s 6.9593 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
all_utf8/lemire_avx time: [517.80 us 518.97 us 521.39 us]
thrpt: [7.8279 GiB/s 7.8644 GiB/s 7.8822 GiB/s]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
all_utf8/lemire_avx_ascii_path
time: [523.97 us 524.27 us 524.63 us]
thrpt: [7.7796 GiB/s 7.7849 GiB/s 7.7894 GiB/s]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
all_utf8/range_sse time: [525.94 us 527.21 us 528.57 us]
thrpt: [7.7216 GiB/s 7.7415 GiB/s 7.7601 GiB/s]
Found 21 outliers among 100 measurements (21.00%)
1 (1.00%) low severe
1 (1.00%) low mild
8 (8.00%) high mild
11 (11.00%) high severe
all_utf8/range_avx time: [392.25 us 392.91 us 393.80 us]
thrpt: [10.364 GiB/s 10.388 GiB/s 10.405 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
8 (8.00%) high severe
all_utf8_with_garbage/libcore
time: [3.6752 ns 3.7353 ns 3.8034 ns]
thrpt: [1137275 GiB/s 1158024 GiB/s 1176952 GiB/s]
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) low mild
2 (2.00%) high mild
8 (8.00%) high severe
all_utf8_with_garbage/lemire_sse
time: [616.73 us 616.89 us 617.03 us]
thrpt: [7.0103 GiB/s 7.0119 GiB/s 7.0136 GiB/s]
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low severe
2 (2.00%) low mild
1 (1.00%) high severe
all_utf8_with_garbage/lemire_avx
time: [551.11 us 552.09 us 554.06 us]
thrpt: [7.8070 GiB/s 7.8349 GiB/s 7.8488 GiB/s]
Found 17 outliers among 100 measurements (17.00%)
5 (5.00%) low severe
9 (9.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
all_utf8_with_garbage/lemire_avx_ascii_path
time: [557.08 us 557.28 us 557.51 us]
thrpt: [7.7587 GiB/s 7.7618 GiB/s 7.7646 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
7 (7.00%) high mild
all_utf8_with_garbage/range_sse
time: [554.79 us 555.10 us 555.42 us]
thrpt: [7.7879 GiB/s 7.7924 GiB/s 7.7967 GiB/s]
all_utf8_with_garbage/range_avx
time: [417.05 us 417.38 us 417.74 us]
thrpt: [10.355 GiB/s 10.364 GiB/s 10.372 GiB/s]
Found 13 outliers among 100 measurements (13.00%)
13 (13.00%) low mild
I re-read some of the Lemire algorithm and there are some key differences which might make it not that suitable for the general string validation.
The two key points are:
I think :tm: that might make them slower for small payloads.
I'm at work now, so I will read this thread later in more detail, but here are some things I want to say:
The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this).
Indeed. I created the crate originally to be included in the Rust internals. I still feel that some of the algorithms need a little bit of refactoring to make them more idiomatic, before being included. I also want to add @Zwegner's algorithm (https://github.com/zwegner/faster-utf8-validator), which seems to be the fastest algorithm that has been invented to date.
But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.
If the build fails, can you open an issue? :) I would appreciate that.
argnidagur/rust-isutf8
Also: You misspelled my name ;)
It requires the string to be dividable by 256 bytes if it is not it will copy the last part of it to a buffer.
Is that the AVX version or something? The SSE version just allocates 16 bytes on the stack for the remaining <16 bytes.
It allocates a struct for each check.
I don't even know what this means. Are we talking about this one?
struct processed_utf_bytes {
__m128i rawbytes;
__m128i high_nibbles;
__m128i carried_continuations;
};
This one gets completely optimized away.
Is that the AVX version or something? The SSE version just allocates 16 bytes.
Sorry, a bit tired, bit not byte :)
This one gets completely optimized away.
Good point
Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?
In reply to @Licenser:
its build to just validate not find the error, so there is no error tracking of where an error is in the string
I have thought about this and we have a few options:
unlikely!
macro (maybe a compile time feature called optimize_for_correct
), this number would probably decrease. The 5% number comes from Zwegner's findings IIRC.We would need to think more about this.
Since this would have to go into core, can core even use runtime checks for target features yet (we need at least SSE 4.1 it seems)?
It can be compile-time only for now. Those who self-compile (such as myself) would see an improvement.
Hey, thanks for the shout out. I want to note that I have made several improvements to my code that aren't present in the master branch (but are in the wip
branch), and it is significantly faster than when it was published (about 40%). Most importantly, I was able to completely elide the separate check for the first continuation byte.
When not memory-bandwidth bound, my algorithm is now over 2x faster (in my tests, on AVX2) than the one in Daniel Lemire's original repo (the one described in the blog post), and my SSE4 path is faster than the AVX2 path of that repo. The algorithm used in simdjson has made some improvements, but last I checked I think my algorithm is still faster.
I still need to finish writing up documentation for the new algorithm. The code's definitely more hairy, from dealing with more architectures (AVX-512 and NEON), handling pointer alignment, generating lookup tables with a Python script, etc... But I'm happy to help out if anyone wants to use it/port it.
@Licenser Are you still working on this?
I ported the validator https://github.com/simd-lite/faster-utf8-validator-rs but from what I understand this all falls apart with the stdlib not being able to use CPU native features under the hood?
It would need to be gated under runtime CPUID checks or left disabled by default, gated under a cfg(target_feature=...)
that is only true if users rebuild libstd with those target features enabled (currently an unstable Cargo feature, but should get more accessible in the future). The latter is easy but only helps software that can afford to only run on newer processors, the former faces the challenge of not regressing performance but would help more users if it worked.
ifuncs or similar mechanisms might also be an option on certain platforms, but they're not very portable and have obscure and hard-to-satisfy constraints. Manually emulating them with function pointers initialized on first call might have more overhead, not sure.
My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because:
Curious what others think about the suitability? Does it make sense only for sufficiently large strings?
This article goes into detail on when it makes sense to use AVX (AVX-512 especially). The most relevant parts:
Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).
The bar for light AVX-512 is lower. Even if the work is spread on all cores, you may only get a 15% frequency on some chips like a Xeon Gold. So you only have to check that AVX-512 gives you a greater than 15% gain for your overall application on a per-cycle basis.
Some older CPUs downclock all cores when any of them are using AVX2 instructions, but for newer ones, they mostly only affect the core running them. Also, the SIMD instructions used for string validation would fall in the "light" category of instructions as they don't involve the floating-point unit.
As @bprosnitz mentioned, take them with a grain of salt, but microbenchmarks certainly imply there is something to be gained in using an accelerated validator.
Note that we have an updated validator called lookup 4...
https://github.com/simdjson/simdjson/pull/993
It is going to be really hard to beat.
@bprosnitz
My understanding is that benchmarks can be misleading for AVX (and possibly SSE) in general purpose code, because: AVX has a startup time that won't be measured when repeatedly looping in a microbenchmark, AVX draws a lot of power and can end up slowing down other cores due to power draw and increased heat, Curious what others think about the suitability? Does it make sense only for sufficiently large strings?
Firstly, let us set aside AVX-512 and "heavy" (numerical) AVX2 instructions. They have their uses (e.g., in machine learning, simulation). But that's probably not what you have in mind.
This being said...
Regarding power usage, it is generally true that the faster code is the code that uses less energy. So if you can multiply your speed using NEON, SSE, AVX, go ahead. You'll come up on top. It is a bit like being concerned with climate change and observing that buses and trains use more energy than cars. They use more energy in total, but less energy per work done. So you have to hold the work constant if you are going to make comparisons. Does it take more energy to do 4 additions, or to use one instruction that does 4 additions at once?
So SIMD instructions are the public transportation of computing. They are green and should be used as much as possible. (Again, I am setting aside AVX-512 and numerical AVX2 instructions that are something more controversial.)
Regarding the fear that SIMD instructions are somehow exotic and rare, and that if you ever use it, you will trigger a chain reaction of slowness... You are using AVX all the time... Read this commit where the committer identified that the hot function in his benchmark was __memmove_avx_unaligned_erms. You can bet that this function is AVX-based. The Golang runtime uses AVX, Glibc uses AVX, LLVM uses AVX, Java, and so forth. Even PHP uses SIMD instructions for some string algorithms. And yes, Rust programs use AVX or other SIMD instructions.
@hanna-kruppe
Manually emulating them with function pointers initialized on first call might have more overhead
I don't think so. We have a well tested approach. It is as simple as that...
internal::atomic_ptr<const implementation> active_implementation{&detect_best_supported_implementation_on_first_use_singleton};
const implementation *detect_best_supported_implementation_on_first_use::set_best() const noexcept {
return active_implementation = available_implementations.detect_best_supported();
}
So you just need an atomic function pointer.
Obviously, you do not get inlining, but that's about the only cost. Loading an atomic pointer is no more expensive, really, than loading an ordinary pointer. So this is pretty much free... except for the first time. You pay a price of first use, but that's not a fundamental limitation: you could set the best function any time, including at startup.
Note that we have an updated validator called lookup 4...
It is going to be really hard to beat.
I mentioned this off-hand on Twitter, but to clarify, in my benchmarks of pure UTF-8 validation, my code with scalar continuation byte checks and unaligned loads is still faster than lookup4 (or at least my translation of lookup4 to C). The difference depends on compiler (on my HSW, testing 1M of UTF-8, GCC gives .227 (me) vs .319 (L4) cycles/byte, while LLVM has .240 (me) vs .266 (L4)).
The picture gets more complicated in the context of simdjson, when plenty of other code is competing for the same execution ports (and the best algorithm is less clear), but I think in the case of Rust, the pure UTF-8 microbenchmarks are probably more representative.
@lemire to clarify, my benchmarks are already using the lookup4 implementation in simdjson (from the pull request's branch).
@milkey-mouse Fantastic.
cc @jkeiser
Blog post: The cost of runtime dispatch.
Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.
This was recently implemented in #74066.
Of some relevance, we are publishing a research paper on the topic:
John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice & Experience (to appear)
I have just released simdutf8, a std::str::from_utf8()
-compatible SIMD UTF-8 validation crate, based on @lemire's lookup4 algorithm, but slightly modified. x86/x86-64 support only so far, but armv7/armv8 neon support is planned.
Apologies if this was already discussed and I missed it, but why is runtime detection needed at all? That seems fundamentally wrong. SIMD should get optionally used at compile time based on the system. Runtime detection should be left to the developer in the form of fat binaries, not the library. That's how everything I know of uses SIMD unless it's doing some kind of JIT, but there's no need for a JIT here.
If the issue is specifically with x86_64 and older CPUs, then you still shouldn't need runtime detection and multiple assembly implementations can be provided based on the supported instruction sets at build time. If the person using the library wants to distribute a binary, they can specify a CPU version to the compiler and build it for that CPU as opposed to a faster one.
Edit: Reading elsewhere apparently the core library is only available in binary form? Is that correct? Cross-compilation seems impossible if that were really the case. It sounds like the correct answer here is to create a new issue to allow Rust to rebuild the core library as part of the optimized build. Adding runtime detection into the core library seems like barking up the completely wrong tree. That would allow optimized implementations of other algorithms in the core library as well and also allow vector optimizations as well during build time. (To be honest if the core library is actually distributed in binary form, I don't know how it got to be this way, it should get built once upon the first time building anything on your system and then the cached build result used from thence forward.)
Reading elsewhere apparently the core library is only available in binary form? Is that correct? Cross-compilation seems impossible if that were really the case. It sounds like the correct answer here is to create a new issue to allow Rust to rebuild the core library as part of the optimized build.
It is possible to re-compile the standard library yourself. The unstable -Zbuild-std
flag of cargo makes this a lot easier. This however increases compilation time and unless you use -Ctarget-cpu
to select a minimum requirement on which cpu your program runs on, SIMD will still not be used. Most of the time people don't use -Ctarget-cpu
in combination with -Zbuild-std
and thus are stuck with a slow validation method. As it is merely slow and not broken, people are unlikely to realize that they can use -Ctarget-cpu
and -Zbuild-std
(and if they are on stable they can't even use -Zbuild-std
yet). As for cross-compilation -Zbuild-std
is made for this, but you can also use rustup target add
to get the standard library compiled for arbitrary supported targets.
(To be honest if the core library is actually distributed in binary form, I don't know how it got to be this way, it should get built once upon the first time building anything on your system and then the cached build result used from thence forward.)
If -Ctarget-cpu=native
were used by default, that would make executables compiled with rustc non-portable as they will only work on your system or newer systems. In addition -Ctarget-cpu=native
simply doesn't work for cross-compilation as you aren't targeting your native cpu.
Thanks for explaining, but this all seems backwards to me. -Ctarget-cpu=native
should be the default and if people want to distribute a binary then additional options should be specified to specify the target platform. The fact that it's even an unstable option to rebuild the standard library also seems backwards. The standard library at the very least should get built every time you build a project (with cross-project caching of the native version, and local caching if a non-native version is used).
Thanks for explaining, but this all seems backwards to me. -Ctarget-cpu=native should be the default and if people want to distribute a binary then additional options should be specified to specify the target platform. The fact that it's even an unstable option to rebuild the standard library also seems backwards. The standard library at the very least should get built every time you build a project (with cross-project caching of the native version, and local caching if a non-native version is used).
I think you should learn that from a packaging standpoint we would not want to package so many binaries such that it is only used by certain processor that certain people have. So instead of creating one binary we will have to create hundreds of binaries. This is also why linux distributions will only build with a very general SIMD that almost every one have during compile time. But with runtime detection, it can keep a single binary at the cost of runtime checks when the application starts and use a faster version without many drawbacks, I believe the performance cost is insignificant compared to the cost for building it for very specific CPU and keeping all the binaries for different specific CPU. This is also how ripgrep does SIMD, which is using runtime detection. Try discussing it with package distributor then you will understand more why is this the case.
That's how everything I know of uses SIMD unless it's doing some kind of JIT, but there's no need for a JIT here.
To follow up with what @pickfire said:
This is also how ripgrep does SIMD, which is using runtime detection.
And ripgrep is in good company. glibc does the same thing.
ripgrep does detection via its dependencies:
Perhaps my view is just too narrow as I'm used to dealing with only fixed server platforms (only targeting one specific CPU) where we build everything destined for the system and my local Mac OS which has a constrained CPU instruction set (and fat binaries for ARM vs x86_64). Either way, I think it should never be excluded as an option if people wish it to be at compile time. Thanks for taking your time and explaining.
@mlindner Yeah indeed. If I only had to deploy to a fixed set of server platforms where I knew exactly what ISA extensions were needed, then I'd totally skip the complexity of runtime CPU feature detection and just rely on compile time.
And ripgrep is in good company. glibc does the same thing.
So I guess feature detection will be done multiple times? Wouldn't it be better to only have the cost of feature detection once across all crates?
So I guess feature detection will be done multiple times? Wouldn't it be better to only have the cost of feature detection once across all crates?
Feature detection is cached in std
. Also most crates use an atomic function pointer, so the decision what implementation to use is done only once.
If you use -Ctarget-cpu
or -Ctarget-feature
to enable target features, "runtime" detection through is_x86_feature_detected!()
for those target features known to be enabled at compile time will compile to true
without any runtime detection taking place at all.
If you use
-Ctarget-cpu
or-Ctarget-feature
to enable target features, "runtime" detection throughis_x86_feature_detected!()
for those target features known to be enabled at compile time will compile totrue
without any runtime detection taking place at all.
That did never work for me (Godbolt).
Looks like the cfg!()
that is supposed to do this is inside libstd instead of inside the is_x86_feature_detected!()
macro: https://github.com/rust-lang/stdarch/blob/d385078ee7cd8656c4a43bcc321578d0f65ba691/crates/std_detect/src/detect/macros.rs#L97
I opened https://github.com/rust-lang/stdarch/issues/1135 for this.
The compatible API in simdutf8 v0.1.1 is now as a fast on small valid inputs and up to 22 times faster on large valid non-ASCII strings (three times as fast on ASCII) - x86-64 with AVX 2 support.
If there is a way forward to get this into std or core I would be happy to work on it.
@hkratz What about arm and those cpu without simd? Will there be a regression?
@hkratz What about arm and those cpu without simd? Will there be a regression?
On arm we would just use the current standard library from_utf8()
implementation until the required SIMD intrinsics are implemented. On older x86 CPUs without SSE 4.2 or AVX 2 we fall back to the scalar implementation at runtime after the initial detection. The overhead is just one additional fn pointer call on subsequent from_utf8() calls.
@m-ou-se having read the entire conversation in this thread, it's not entirely clear to me what is blocking this from progressing. Would this set a precedent for runtime feature detection, and or something else? I'd appreciate if you, or someone else from the lib team could help me and others like @hkratz understand what would be need to get this into libcore.
@Voultapher The main problem is how to detect which target features are supported from inside libcore. Except for x86, all platforms require asking the OS for which target features are supported by the CPU. libcore by definition doesn't know anything about the OS and as such can't know which target features are supported by the CPU. It is not safe to use target features before checking if they are supported. The results of doing so can range from crashes to unexpected instructions executing. In any case it will not result in the correct results.
The Rust runtime does rely on runtime feature detection, but, as far as I can tell, only indirectly (e.g., by calling memcpy).
Languages like Go, Java, C/C++... do use (quite often) runtime feature detection as part of the core libraries.
A sensible way forward would be to add runtime CPU feature detection directly in the Rust core.
I agree, there are enough examples that demonstrate it can be done within the larger constrains of the Rust language. A potential nice side-effect would be efficient and correct feature detection in the standard library, there are enough libraries and tools ripgrep, memchr etc. that have to hand roll it today. And tbh intrinsic support feels incomplete without feature detection.
CPU feature detection is supported as part of libstd, as this is the part of the standard library that is allowed to interface with the OS, but libcore does not. How should CPU feature detection work when compiling a kernel against libcore? That would require some way to pass the supported features to libcore. What if the kernel works right now with stable rustc and doesn't provide the supported features to libcore as there is currently no such thing. How can the target feature detection support be added to libcore without breaking this kernel that doesn't pass the supported target features to libcore.
To clarify, I poorly worded my earlier comment. I'm thinking about both feature detection and corresponding dispatch. For example implemented with atomic function pointer overwrite.
I'm not deep enough into the Rust standard library to even fully understand the implications of what you said. I'm looking at this from a user perspective. Many Rust applications use the standard library, not only libcore. Part of the standard library interface are directly and transitively calls that perform UTF-8 validation. Somewhere in this stack of abstractions, feature detection and 'dispatch' can happen to use more efficient versions.
Naively, if it can't easily be added to libcore, why not add it to the stdlib only, as first step?
CPU feature detection is supported as part of libstd, as this is the part of the standard library that is allowed to interface with the OS, but libcore does not. How should CPU feature detection work when compiling a kernel against libcore? That would require some way to pass the supported features to libcore. What if the kernel works right now with stable rustc and doesn't provide the supported features to libcore as there is currently no such thing. How can the target feature detection support be added to libcore without breaking this kernel that doesn't pass the supported target features to libcore.
One way would be to tell the core library that SIMD implementations are supported (core::somewhere::set_cpu_features_enabled(...)
) or setting a callback which returns if they are supported (core::somewhere::set_cpu_features_check_fn(...)
) from the std startup code.
For aarch64 detection would not even be needed because ASIMD/neon is enabled by default on nightly Rust so LLVM autovectorizes to ASIMD/neon code already.
Naively, if it can't easily be added to libcore, why not add it to the stdlib only as first step?
std::str::from_utf8()
is just an alias (via pub use) for core::str::from_utf8()
. Changing that would be a breaking change.
Introduction
The "Parsing Gigabytes of JSON per second" post (ArXiv - langdale, lemire) proposes a novel approach for parsing JSON that is fast enough that on many systems it moves the bottleneck to the disk and network instead of the parser. This is done through the clever use of SIMD instructions.
Something that stood out to me from the post is that JSON is required to be valid utf-8, and they had come up with new algorithms to validate utf-8 using SIMD instructions that function much faster than conventional approaches.
Since rustc does a lot of utf-8 validation (each
.rs
source file needs to be valid utf-8), it got me curious about what rustc currently does. Validation seems to be done by the following routine:https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L1500-L1618
This doesn't appear to use SIMD anywhere, not even conditionally. But it's run a lot, so I figured it might be interesting to use a more efficient algorithm for.
Performance improvements
The post "Validating UTF-8 strings using as little as 0.7 cycles per byte" shows about an order of magnitude performance improvement on validating utf-8, going from
8
cycles per byte parsed to0.7
cycles per byte parsed.When passing Rust's validation code through the godbolt decompiler,
from_utf8_unchecked
outputs 7 instructions, andfrom_utf8
outputs 57 instructions. In the case offrom_utf8
most instructions seem to occur inside a loop. Which makes it likely we'll be able to observe a performance improvement by using a SIMD-enabled utf-8 parsing algorithm. Especially since economies of scale would apply here -- it's not uncommon for the compiler to parse several million bytes of input in a run. Any improvements here would quickly add up.All examples linked have been compiled with
-O -C target-cpu=native
.Also ecosystem libraries such as
serde_json
perform utf-8 validation in several locations, so would likely also benefit from performance improvements to Rust's utf-8 validation routines.Implementation
There are two known Rust implementations of lemire's algorithm available in Rust today:
The latter even includes benchmarks against the compiler's algorithm (which makes it probable I'm not be the first person to think of this). But I haven't been able to successfully compile the benches, so I don't know how they stack up against the current implementation.
I'm not overly familiar with rustc's internals. But it seems we would likely want to keep the current algorithm, and through feature detection enable SIMD algorithms. The
simdjson
library has different algorithms for different architectures, but we could probably start with instructions that are widely available and supported on tier-1 targets (such asAVX2
).These changes wouldn't require an RFC because no APIs would change. The only outcome would be a performance improvement.
Future work
Lemire's post also covers parsing ASCII in as little as 0.1 cycles per byte parsed. Rust's current ASCII validation algorithm validates bytes one at the time, and could likely benefit from similar optimizations.
https://github.com/rust-lang/rust/blob/2f688ac602d50129388bb2a5519942049096cbff/src/libcore/str/mod.rs#L4136-L4141
Speeding this up would likely have ecosystem implications as well. For example HTTP headers must be valid ASCII, and are often performance sensitive. If the stdlib sped up ASCII validation, it would likely benefit the wider ecosystem as well.
Conclusion
In this issue I propose to use a SIMD-enabled algorithm for utf-8 validation in rustc. This seems like an interesting avenue to explore since there's a reasonable chance it might yield a performance improvement for many rust programs.
I'm somewhat excited to have stumbled upon this, but was also surprised no issue had been filed for this yet. I'm a bit self-aware posting this since I'm not a rustc compiler engineer; but I hope this proves to be useful!
cc/ @jonas-schievink @nnethercote
References