quickwit-oss / bitpacking

SIMD algorithms for integer compression via bitpacking. This crate is a port of a C library called simdcomp.
MIT License
270 stars 30 forks source link

please use runtime dispatch via target_feature #3

Closed BurntSushi closed 6 years ago

BurntSushi commented 6 years ago

It would be great if this crate didn't require using any compilation flags at all, and instead auto-configured itself automatically at runtime based on what the running CPU supports. This process is documented in the std::arch module docs.

Some other nits:

fulmicoton commented 6 years ago

Thanks for your comment. This is very precious to get that kind of comment when the crate is young and has no other clients than tantivy.

Let me check I understand the philosophy behind the best practise of SIMD in rust : the compilation of those instructions is independant from the target-cpu compilation flag. Whatever the target-cpu, the instructions I requested will be compiled as is. The idea is to produce binaries that embed optimized code using SIMD instructions, along with some scalar polyfill implementation. At runtime, the best implementation can then be selected after sniffing the features available in the CPU.

I have a couple of issues with this approach, some of which are specific to my use case, and I would like to know your opinion on the subject.

In my case, the different implementations of the bitpackers create formats that are incompatible one another. That's because the bitpacking at the register type level. For instance, when bitpacking over 4 bits using SSE3, the first byte will contain integer 0, and integer 5 (instead of integer 0 and integer 1).

It is possible to create a bitpacker compatible with the SSE3 format using only scalar instructions, but it is likely to be slower than the vanilla scalar bitpacker. If someone is targetting a platform and knows that clients won't have SSE3 (e.g.: cellphones), one will most likely prefer to use the scalar bitpacker format than the slower SSE3 format polyfill.

There exists even a use case for people to use the ScalarBitPacker and the SSE3BitPacker in the same application : while this is not implemented yet, the scalar bitpacker makes it possible to get random access to a range of values.

Another thing that concerns me is that many rustaceans are not aware of the target-cpu flag. In practise it is typically not a game changer... Except that when one uses SIMD instructions, compiling for a target-cpu that does not handle those SIMD instructions spectacularly destroy performances. I can only hope for people to read the README and compile programs using tantivy with the right compilation flags. In the end, I wonder if it wouldn't be less tricky for the users to just : have a sse3 feature flag and prevent them from compilling with a target-cpu that does not handle sse3... But I believe there is no way to return a clean looking error message today.

After reading those concerns, what do you think would be the best solution here ?

BurntSushi commented 6 years ago

Let me check I understand the philosophy behind the best practise of SIMD in rust : the compilation of those instructions is independant from the target-cpu compilation flag. Whatever the target-cpu, the instructions I requested will be compiled as is. The idea is to produce binaries that embed optimized code using SIMD instructions, along with some scalar polyfill implementation. At runtime, the best implementation can then be selected after sniffing the features available in the CPU.

You got it!

In my case, the different implementations of the bitpackers create formats that are incompatible one another. That's because the bitpacking at the register type level. For instance, when bitpacking over 4 bits using SSE3, the first byte will contain integer 0, and integer 5 (instead of integer 0 and integer 1).

Oh... Yikes. That's a noodle scratcher. This does seem orthogonal to runtime dispatch to me though, no?

I do appreciate that this means you can't just expose one set of methods and have the crate "pick" the right one. You'll need to give callers a choice. But with runtime dispatch, you can give callers every choice available. e.g.,

// always succeeds
fn create_scalar_compressor() -> ScalarCompressor { ... }

// only succeeds when current CPU supports SSSE3
fn create_ssse3_compressor() -> Option<SSSE3Compressor> { ... }

// only succeeds when current CPU supports AVX2
fn create_avx2_compressor() -> Option<AVX2Compressor> { ... }

The corresponding types you get back would then have the methods suitable for doing compression (or decompression too, whatever).

It is then up to the caller to figure out which one to use. It's probable that the caller should return an error if its CPU doesn't have the appropriate capabilities to read the compressed data. e.g., Creating an index on an AVX2 CPU and trying to read it on a CPU that doesn't support AVX2 I guess should produce an error. It seems likely that you want/need this even if you stick with compile time options. Otherwise, if I run tantivy that happened to be compiled using different options and end up sacrificing the integrity of my data, then thats really bad.

Note that the ssse3 and avx2 types are also platform specific and should only be available on x86. You have some flexibility here in how you deal with it, but the simplest is probably to just promote the platform specificness into your own public API, but you could technically abstract over it (and I'd subsequently rename the types so that they didn't include ISA extension names in them).

(I probably still might not expose a trait for this since I try to keep APIs small, but I can definitely see the motivation given multiple distinct alternatives. And I could see callers would want to use the trait. You could also declare that the caller should define their own trait.)

Another thing that concerns me is that many rustaceans are not aware of the target-cpu flag. In practise it is typically not a game changer... Except that when one uses SIMD instructions, compiling for a target-cpu that does not handle those SIMD instructions spectacularly destroy performances. I can only hope for people to read the README and compile programs using tantivy with the right compilation flags.

In the end, I wonder if it wouldn't be less tricky for the users to just : have a sse3 feature flag and prevent them from compilling with a target-cpu that does not handle sse3... But I believe there is no way to return a clean looking error message today.

Yeah this is exactly why I think people should be using runtime feature detection by default unless there is a good reason not to do it. With runtime feature detection, you shouldn't ever need to worry about compile time flags. This impacts every dependent of tantivy as well. For example, if I build an application that uses tantivy internally and start distributing binaries on GitHub, it would be awesome if those binaries took advantage of SIMD if it were available automatically. Doing this for compile time feature configuration is a huge burden; you need to ship a different binary for each permutation of target feature you want to support. Doing this for runtime feature detection is free: as the application writer, I don't even need to be aware of the fact that tantivy uses CPU specific features. It will Just Work. That is an enormous benefit, and IMO, it is worth doing almost everything possible to get it.

BurntSushi commented 6 years ago

Also, I think the fact that the different methods here produce distinct incompatible bitpackings should be documented in huge bold letters. I didn't realize that until you told me. :-)

BurntSushi commented 6 years ago

@fulmicoton I will also say this: it may be the case that I wind up needing to use a crate like this, and if you'd like to defer the work to address this issue, then I'd probably be happy to take it on assuming you're comfortable with major refactorings. I don't know when it will happen (it could be a year or more, or it could be a month, I have no clue).

fulmicoton commented 6 years ago

@BurntSushi The Option<> seems like the right thing to do.

Just to make sure of what should be done, let me be very specific here :

With runtime feature detection, you shouldn't ever need to worry about compile time flags... it would be awesome if those binaries took advantage of SIMD if it were available automatically.

Maybe I am misunderstanding your point, but unfortunately, in my experience it does not work that way. Right now, if you use SIMD instruction "dynamically" on a binary that was not compiled with the right compilation flag your performance will suck badly. For instance, the performance of cargo bench on this crate are terrible if you omit to specify the target-cpu flag. I read somewhere that this was because the presence of SIMD instructions was preventing inlining.

I will also say this: it may be the case that I wind up needing to use a crate like this, and if you'd like to defer the work to address this issue, then I'd probably be happy to take it on assuming you're comfortable with major refactorings. I don't know when it will happen (it could be a year or more, or it could be a month, I have no clue).

I am open to this refactoring and I would love if you could take over this. I may have some bandwidth somewhere at the end of the month, but I just changed jobs and it might require some paperwork for me to resume working on opensource project. Let's do it this way. Ping me if you feel you will have some time to work on this, so that I can assign the ticket to you. If I start working on it earlier, I will assign it to me.

BurntSushi commented 6 years ago

Maybe I am misunderstanding your point, but unfortunately, in my experience it does not work that way. Right now, if you use SIMD instruction "dynamically" on a binary that was not compiled with the right compilation flag your performance will suck badly. For instance, the performance of cargo bench on this crate are terrible if you omit to specify the target-cpu flag. I read somewhere that this was because the presence of SIMD instructions was preventing inlining.

Oh I see, yeah, this is definitely not the intended behavior! I ported the regex crate over to runtime feature detection and everything works as expected. The most likely reason why function inlining isn't working for you is because of target feature mismatches. If you read the dynamic CPU feature detection section in the docs, you'll see an example of how to do it the right way, which requires use of the #[target_feature] attribute.

Other links:

BurntSushi commented 6 years ago

I am open to this refactoring and I would love if you could take over this. I may have some bandwidth somewhere at the end of the month, but I just changed jobs and it might require some paperwork for me to resume working on opensource project. Let's do it this way. Ping me if you feel you will have some time to work on this, so that I can assign the ticket to you. If I start working on it earlier, I will assign it to me.

Sounds good. I have no timeline, but just wanted to put this out there!

fulmicoton commented 6 years ago

If you read the dynamic CPU feature detection section in the docs, you'll see an example of how to do it the right way, which requires use of the #[target_feature] attribute.

stdsimd#401 is exactly the problem i have. I will try and see if #[target_feature] helps ! Thank you so much.

fulmicoton commented 6 years ago

(@burntsushi I quickly test it confirm that this solved my perf problem !)

fulmicoton commented 6 years ago

I think I should be able to take care of this issue this week. If that's ok I'll send you a CR @BurntSushi

BurntSushi commented 6 years ago

Sounds good! No rush at all. :-)

fulmicoton commented 6 years ago

PR available here. https://github.com/tantivy-search/bitpacking/pull/5

Can you have look at it ? @BurntSushi

It seems like I cannot mark you as a reviewer because you dont belong to the tantivy-search github organization. (I'd gladly add you to it if you want to)