jbapple / libfilter

High-speed Bloom filters and taffy filters for C, C++, and Java
Apache License 2.0
32 stars 6 forks source link

_mm256_testc_si256 performance is bad #5

Closed weijietong closed 2 years ago

weijietong commented 2 years ago

@jbapple I test the blocked bloom filter (which is https://github.com/FastFilter/fastfilter_cpp/blob/master/src/bloom/simd-block.h version) in my codes. The vtune result shows that Find avx2 implementation's _mm256_testc_si256 opcode's performance is bad.

image

So, do we have other tricks to improve it ? Maybe a prefetch will help ?

weijietong commented 2 years ago

To be clear, at my test, the bloom filter hash value test totally elapsed 5.5s , the vptest instruction code takes 4.94s, the vptest occupies the largest performance.

jbapple commented 2 years ago

How large was your filter? What processor were you using?

weijietong commented 2 years ago

I runed the tpch-300 Q7, the vtune result is captured by running the Q7 circularly.The result shows that the chocke point is the vpptest. The largest filter size is 1048576(calculated by 1ull << (_log_num_buckets + LOG_BUCKET_BYTE_SIZE)). My processor is intel skylake. The cpu cache size is 33792 KB.

jbapple commented 2 years ago

Yes, I think pre-fetching might help, if you're doing a lot of lookups in bulk. I can take that as a feature request.

What DBMS are you using?

weijietong commented 2 years ago

alicloud's hologres

jbapple commented 2 years ago

I've tested out prefetching and I can't seem to get a performance benefit on my machine from filters as small as 1MiB, though it's close for larger filters.

weijietong commented 2 years ago

@jbapple https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_testc_si256&ig_expand=7216 the doc shows that vptest has a high latency at the skylake architecture cpu.

weijietong commented 2 years ago

what about replace it with two operations: a bitwise-and operation and then a popcount operation

weijietong commented 2 years ago

I implemented the prefetch logic,got a minor performance improvement.

jbapple commented 2 years ago

I implemented the prefetch logic,got a minor performance improvement.

Great! Can you show me how you did it, since our results differed? I'm wondering if I did something slow.

weijietong commented 2 years ago

I am pleasure to share. But the prefetch is implemented outside this lib and is related to my own codes. Basic logic is to hold 32 number of hash values which are prefetched first(a prefetch is to prefetch a bucket), then to do the actual test of the held hash values. If you share what you wrote ,I can point it out.