tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.22k stars 237 forks source link

SIMD accelerated escape routines #405

Open dralley opened 2 years ago

dralley commented 2 years ago

Unlike the unescape routines, the routines for escaping text don't currently utilize any SIMD accelleration.

This should be possible to do via the jetscii crate. memchr is currently used by the unescape routines, but while it is supposed to be slightly faster than jetscii it is also more limited and can only handle searching for up to 3 different bytes at a time, whereas jetscii can handle up to 16. Since escaping text requires searching for up to 5 characters <>&" ', memchr is not an option but jetscii is.

jetscii also seems capable of searching for recognizing byte sequences as well as single bytes, so it could potentially be used with UTF-16 and other multibyte encodings in the future (but I don't think you can search for multiple byte-sequence-patterns at the same time, so there's limitations to this).

Benchmark coverage needs to be added first: https://github.com/tafia/quick-xml/issues/404

dralley commented 2 years ago

Preview

image

Mingun commented 2 years ago

This is effect from switching to jetscii?

dralley commented 2 years ago

Yes, and it only requires about 3 lines of change. I'm going to see if it can be improved any further and whether the occasional regressions can be eliminated.

dralley commented 5 months ago

Some reading material, not so much for escape routines specifically but parsing XML (actually HTML) in general

https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/