oxigraph / rio

RDF parsers library
Apache License 2.0
87 stars 10 forks source link

RDF/XML parser could be easily optimized #25

Open Tpt opened 4 years ago

Tpt commented 4 years ago

The current RDF/XML parser is quite naive and copies the latest context each time an opening tag is read. I believe the parser could be easily speedup by avoiding such copies.

thadguidry commented 3 years ago

Additional areas to consider would be utilizing and verifying that intrinsic functions in processors are being used and taken advantage of when available.

A few things I've thought of as I've perused your code:

Intrinsic String Compare within XML

#[target_feature(enable = "sse4.2")] (Intel Skylake processors and above) could be utilized for a lot of the string comparison being done in XML parser.rs I don't know the Rust ecosystem, but noticed the intrinsic functions defined here https://doc.rust-lang.org/std/intrinsics/index.html but didn't see any mm_cmp_xxxxx (string compare) functions, so not sure how Rust plays that out, perhaps resorts to LLVM at times, but then the code functions need to be conditionally aligned for that and compiler hints added. (I'm more familiar with how Java deals with this @IntrinsicCandidate annotations, etc. And to see if intrinsic methods are being utilized or not and where in compiled code, you add: -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining)

Here's the new String Compare functions available from SSE4.2 : https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=898,2862,2861,2860,2863,2864,2865&techs=SSE4_2&cats=String%25252525252525252525252520Compare

and talked about in this developer article :

https://software.intel.com/content/www/us/en/develop/articles/schema-validation-with-intel-streaming-simd-extensions-4-intel-sse4.html

Intrinsic escaping with SIMD

Escaping in api\src\model.rs could possibly be more performant using SIMD instructions something like https://docs.rs/v_escape/0.15.0/v_escape/ and written about more here https://brandur.org/nanoglyphs/008-actix#simd-escape (or there might be something in Rust core or the ecosystem that can do that now. I also noticed from a release 2 years ago this on Rust 1.27 https://github.com/rust-lang/rust/blob/master/RELEASES.md#libraries-22

SIMD (Single Instruction Multiple Data) on x86/x86_64 is now stable. This includes arch::x86 & arch::x86_64 modules which contain SIMD intrinsics, a new macro called is_x86_feature_detected!, the #[target_feature(enable="")] attribute, and adding target_feature = "" to the cfg attribute.

thadguidry commented 1 year ago

@Tpt Looks like there is already a library jetscii that handles sizes of 8 or 16-bit characters since it uses instructions PCMPESTRI and PCMPESTRM on CPUs that use SSE4.2. (and there are other String Compare functions as noted in previous comment) Some benchmarks are noted on quick-xml which we use already in the parser. So maybe an easy performance win? The other area might be in serialization and hashing, which I leave it to others to find appropriate SIMD libraries in the Rust ecosystem.