powturbo / Turbo-Base64

Turbo Base64 - Fastest Base64 SIMD:SSE/AVX2/AVX512/Neon/Altivec - Faster than memcpy!
GNU General Public License v3.0
280 stars 41 forks source link

Low performance on short strings. #2

Closed alexey-milovidov closed 4 years ago

alexey-milovidov commented 4 years ago

https://github.com/ClickHouse/ClickHouse/issues/8397#issuecomment-568909002

The library behaves worse than https://github.com/aklomp/base64 on strings of average length 77 bytes:

:) SELECT avg(length(URL)) FROM test.hits

SELECT avg(length(URL))
FROM test.hits

┌──avg(length(URL))─┐
│ 77.54074297450794 │
└───────────────────┘

You can download the test data here: https://clickhouse.yandex/docs/en/getting_started/example_datasets/metrica/

powturbo commented 4 years ago

I've optimzed all the Turbo Base64 functions also for short inputs. It is also possible to disable checking for more faster decoding by compiling Turbo Base64 with

make NCHECK=1

powturbo commented 4 years ago

Now it is also possible to do a direct call to the archtitecture dependent functions instead of tb64enc+tb64dec. Use _tb64e+_tb64d (after calling initialization tb64ini at the program start) instead of tb64enc+tb64dec. This saves a check + a non-inlined function call.

powturbo commented 4 years ago

There are now optimized functions for short strings. Actually only for avx2. Just call the new tb64ini with the parameter isshort = 1 at the start of the program. See turbob64.h

alexey-milovidov commented 4 years ago

This saves a check + a non-inlined function call.

Perfect! I will try right now...

alexey-milovidov commented 4 years ago

We run all our tests with UBSan and it said that better to replace unaligned stores with memcpy:

https://clickhouse-test-reports.s3.yandex.net/8444/c92e7e742c35a653a1244317ad28fddc4b6ba617/functional_stateless_tests_(ubsan)/stderr.log

https://clickhouse-test-reports.s3.yandex.net/8444/c92e7e742c35a653a1244317ad28fddc4b6ba617/functional_stateless_tests_(ubsan).html

powturbo commented 4 years ago

Unaligned access is used only for 32-bits integers using the ctou32 macro. You can see in "conf.h" all the recent cpu's intel/amd, arm 32/64 bits, powerpc,... are supporting unaligned 32-bits access. If it's necessary, I can replace this macro with memcpy using a preprocessor switch in "turbob64c.c" and "turbob64d.c"

alexey-milovidov commented 4 years ago

Yes, it's 100% safe from the CPU standpoint but it isn't according to the C or C++ standard. Replacing with memcpy would not impact performance in any kind.

powturbo commented 4 years ago

I've changed the unaligned access to memcpy and made the decoding 5% more faster. I'll optimize late SSE, AVX and ARM for short strings.

alexey-milovidov commented 4 years ago

Thank you! Let's see what our CI will show...

powturbo commented 4 years ago

You must always include turbob64sse in your builds (cmake files), not only for amd64. see turbobase64 makefile Please, upload the latest version. More faster for very short strings.

alexey-milovidov commented 4 years ago

Ok. BTW, performance test is finished to run and we see significant performance improvement!

https://clickhouse-test-reports.s3.yandex.net/8444/aa979eb438aa06c082d6825d11f40b5e402761a3/performance_test/comparison_min_time_gcc_9.html

(the queires with base64 are near the top)

alexey-milovidov commented 4 years ago

I'll close this issue as it is completely resolved. And I would like to especially thank you for your help!

I will try to finish the integration of Turbo-Base64 to our product in the nearest days...

alexey-milovidov commented 4 years ago

There is one minor issue remains to integrate this library: https://github.com/ClickHouse/ClickHouse/pull/8444#issuecomment-573323386 If you are interested you can finish and open another PR.

powturbo commented 4 years ago

Hi, I've added the checking for the short strings decoding functions. Due to other overheads in ClickHouse, you can't see the full speed advantage of turbobase64, but in my short strings tests it is 3 times faster in encoding and 3.5-4 times faster in decoding (with checking) than your current base64. see short strings benchmark I don't know the origin of the base64 strings in ClickHouse, but if they are coming from an external source, a database should normally do a one time check at insertion. If you're self generating the base64 strings, an additional checking is meaningless. Please reopen the PR, I'm very interested in this subject.

powturbo commented 4 years ago

Turbobase64 now extended to do full checking per default in the short string functions. Functional tests with full checking are successful.

alexey-milovidov commented 4 years ago

I have updated the library and also added -DB64CHECK but it didn't help.

powturbo commented 4 years ago

You must update the library with latest changes 3 hours ago as in my last comment. The short strings functions are now doing the check per default. I've tested this and it's working.

alexey-milovidov commented 4 years ago

Now it looks all right, thank you!

alexey-milovidov commented 4 years ago

Congratulations, it's merged!

PS. You can add a reference, tweet, whatever you want!

powturbo commented 4 years ago

Great, thank you. Your team might have a look also at TurboPFor. There are not only the best/fastest integer compression functions, but also interesting lossless and lossy floating-point compression in fp.c. See also the error-controlled lossy convertion FP functions in bitutil.c You can build or download icapp to test all functions with your data.

alexey-milovidov commented 4 years ago

Yes, we have high demand for efficient codecs. We will try to integrate your libraries and I appreciate if you will help!

(It will be done not by me, this is the task for intern students to try. Will see how it will go...)

alexey-milovidov commented 4 years ago

BTW, it's possible for libraries with compatible license (not GPL).

powturbo commented 4 years ago

Nice to see you're evaluating the integration of other components. I can probably make a separate non-GPL package with the functions you're interested in.