In-tree ZFS-ZSTD clash with in-kernel ZSTD

nivedita76 commented 4 years ago

System information

Type	Version/Name
Distribution Name
Distribution Version
Linux Kernel
Architecture
ZFS Version
SPL Version

Describe the problem you're observing

ZSTD is already in the linux kernel, and it produces multiple definition errors if ZSTD is builtin to the kernel and ZFS is being compiled as builtin.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

terrelln commented 4 years ago

While it's awesome you provide input on how we could this, ZSTD-on-ZFS has left the design stage by now and with the current design there isn't any benefid on using in-kernel zstd....

No worries, was just letting you know the option is now available :)

BrainSlayer commented 4 years ago

On Fri, Aug 28, 2020 at 11:22:21AM -0700, Kjeld Schouten-Lebbing wrote: @terrelln ZSTD has already been implemented, there are no plans for using in-kernel zstd, talks about this have been concluded for months by now... Anyhow: In essence this isn't really something about the ZSTD version, its about ZFS. ZSTD does checksumming, even compatible versions of ZSTD might result in slightly different checksums and we need to account for it. ZSTD also needs to support a lot of different platforms, both Linux and FreeBSD based. Having to rely on varying versions of ZSTD in the kernel overcomplicates things to the n'th degree. The current implementation of zstd on zfs is quite solid, vetted just fine and this bug was fixed already without much issue. While it's awesome you provide input on how we could this, ZSTD-on-ZFS has left the design stage by now and with the current design there isn't any benefid on using in-kernel zstd.... -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #10763 (comment) There is at least one benefit: getting rid of duplication. zstd adds about 0.5Mb to the ZFS kernel build. Thanks.

you know whats a big disadvantage? the in kernel zstd variant is slower and since its missing the custom allocator feature its unuseable and will not work. we wrote a special custom allocator which increased the overall performance multiple times. without it, it will not pass the zfs tests, since they will all timeout for performance reasons. so there was never a option to use the in kernel variant.

PrivatePuffin commented 4 years ago

@BrainSlayer thanks for pointing out the main argument I was too stupid to fully grasp/explain! 👍

BrainSlayer commented 4 years ago

@Ornias1993 dont blame yourself. i'm more wondering why people requesting such things without doing a little bit deeper research what they are talking about. if the kernel one time in future provides all features it might be a option for future versions, but even then only a benchmark will show if its worth to follow up on this

PrivatePuffin commented 4 years ago

@Ornias1993 dont blame yourself. i'm more wondering why people requesting such things without doing a little bit deeper research what they are talking about. if the kernel one time in future provides all features it might be a option for future versions, but even then only a benchmark will show if its worth to follow up on this

Good point, but even in that case the whole host of other issues still need to designed around... Not saying it will never happen, but not anytime soon(tm) ;)

terrelln commented 4 years ago

Sorry, I wasn't intending to come here and say that OpenZFS for Linux should have used the version of zstd in the kernel. Definitely, given the state of zstd in the kernel, bundling upstream zstd is the better choice. Not to mention the complexity of using two incompatible versions of zstd, one for Linux and one for BSDs.

I wrote the version in the kernel before upstream zstd was ready to be used as-is, and was too new to the project to know all the changes necessary to get it there. Now, as you've proven with OpenZFS, zstd is ready to be used as-is with either ZSTD_customMem or ZSTD_initStatic*().

you know whats a big disadvantage? the in kernel zstd variant is slower and since its missing the custom allocator feature its unuseable and will not work. we wrote a special custom allocator which increased the overall performance multiple times. without it, it will not pass the zfs tests, since they will all timeout for performance reasons. so there was never a option to use the in kernel variant.

Just curious, what are you comparing against? Are you comparing against allocating a context for every (de)compression call? I would certainly believe that is several times slower. The custom allocator approach seems to be similar to the approach that btrfs has taken of caching workspaces, for example, and I would expect similar performance.

Looking at the code in zfs_zstd.c, you may be able to gain a small amount of compression performance by caching the ZSTD_CCtx objects. When reusing a ZSTD_CCtx it doesn't need to zero its tables, saving a ~100KB memset at level 3. This does introduce the complexity of caching the ZSTD_CCtx objects, so maybe you've already considered this and ruled it out due to the architectural complexity.

When I update zstd in the kernel I will be updating btrfs to reuse contexts instead of just workspace memory. At that point I will carefully measure the speed difference, but I would estimate a gain of no more than 10% at low levels, probably more like 3-5%.

if the kernel one time in future provides all features it might be a option for future versions, but even then only a benchmark will show if its worth to follow up on this

I'm working on porting zstd-1.4.6 (next version) as-is in the kernel, and making it trivial to keep up to date. It will be fully featured and be using upstream directly, like OpenZFS. So it should have identical performance. There would be no good reason to switch, given bundling zstd already works, and probably some extra complexity, but the option will be there if needed.

If there is anything that the OpenZFS project needs/wants from zstd please let me know / open an issue.

BrainSlayer commented 4 years ago

Sorry, I wasn't intending to come here and say that OpenZFS for Linux should have used the version of zstd in the kernel. Definitely, given the state of zstd in the kernel, bundling upstream zstd is the better choice. Not to mention the complexity of using two incompatible versions of zstd, one for Linux and one for BSDs.

I wrote the version in the kernel before upstream zstd was ready to be used as-is, and was too new to the project to know all the changes necessary to get it there. Now, as you've proven with OpenZFS, zstd is ready to be used as-is with either ZSTD_customMem or ZSTD_initStatic*().

you know whats a big disadvantage? the in kernel zstd variant is slower and since its missing the custom allocator feature its unuseable and will not work. we wrote a special custom allocator which increased the overall performance multiple times. without it, it will not pass the zfs tests, since they will all timeout for performance reasons. so there was never a option to use the in kernel variant.

Just curious, what are you comparing against? Are you comparing against allocating a context for every (de)compression call? I would certainly believe that is several times slower. The custom allocator approach seems to be similar to the approach that btrfs has taken of caching workspaces, for example, and I would expect similar performance.

Looking at the code in zfs_zstd.c, you may be able to gain a small amount of compression performance by caching the ZSTD_CCtx objects. When reusing a ZSTD_CCtx it doesn't need to zero its tables, saving a ~100KB memset at level 3. This does introduce the complexity of caching the ZSTD_CCtx objects, so maybe you've already considered this and ruled it out due to the architectural complexity.

When I update zstd in the kernel I will be updating btrfs to reuse contexts instead of just workspace memory. At that point I will carefully measure the speed difference, but I would estimate a gain of no more than 10% at low levels, probably more like 3-5%.

if the kernel one time in future provides all features it might be a option for future versions, but even then only a benchmark will show if its worth to follow up on this

I'm working on porting zstd-1.4.6 (next version) as-is in the kernel, and making it trivial to keep up to date. It will be fully featured and be using upstream directly, like OpenZFS. So it should have identical performance. There would be no good reason to switch, given bundling zstd already works, and probably some extra complexity, but the option will be there if needed.

If there is anything that the OpenZFS project needs/wants from zstd please let me know / open an issue.

no we dont allocate a new context on decompression and compression. the allocator is creating a multithread safe memory allocation pool which releases objects by itself if not beeing used and does cache allocations in a own way. decompression isnt a big issue since the context for decompression is small, but we cache it too for performance reasons. the compression context is the issue since it can be very big. with high compression ratio (19) its about 80 megs per thread and zfs works multithreaded. on a standard 8 core system 32 threads are running at the same time. this is a problem for allocation (allocating big blocks is just slow even with standard compression settings). the slab cache doesnt help either. 1st the maximum allocatable memory is limited in the zfs implementation. and if we unlock that limit its still slower than our own memory pool approach. this memory pool is also a very specialized implementation just made for zstd and shows the best performance of all solutions. the approach is basicly resusing context whenever possible and avoid allocations at standard run time. so yeah. standard allocation pool approach, just faster than any standard solution we tested. but anyway. as soon as you finished your upstream work in the kernel i will review it and will also do a performance comparisation. consider that our zstd implementation is staticly compiled as whole combine sourcecode with with explicit -O3 like the original library is doing. this allows the compiler to optimize the code much better than any modular source variant. compiling a full featured zstd into the kernel is also not possible since some function of zstd will have a stack usage of > 20kb. this is not kernel code compatible. we dont use these functions in our code, but this needs to be considered by you

PrivatePuffin commented 4 years ago

@BrainSlayer good note on the stack size, it was one of the few aspects of zstd we didnt really like (even though we dont use said codepaths)

@terrelln do you want an issue for that on the zstd project?

BrainSlayer commented 4 years ago

@Ornias1993 the stack issue can be fixed, but will decrease the performance usually. but maybe there is also a better solution for this hufman table macro hell on local stack

PrivatePuffin commented 4 years ago

but maybe there is also a better solution for this hufman table macro hell on local stack

Thats why i'm asking if he wants an issue for it ;)

terrelln commented 4 years ago

no we don't allocate a new context on decompression and compression.

I know that you reuse the context memory in your allocator. I mean the creation of the zstd context itself on this line https://github.com/openzfs/zfs/blob/abe4fbfd01c2440813c9f31ca6727473e22d0039/module/zstd/zfs_zstd.c#L365 If you cache the ZSTD_CCtx object itself, and reuse it, you avoid a ~100KB memset() at level 3 inside of zstd on subsequent uses of the ZSTD_CCtx with the same parameters. This is a much smaller deal than the allocation, because as you said allocation is costly. But, I suspect it would show up in compression performance, and cost probably around 3-5% of compression CPU. I realized I'm assuming 128KB blocks with that performance assessment, but I'm not sure that is accurate for ZFS. The larger the block size the less this will matter.

consider that our zstd implementation is staticly compiled as whole combine sourcecode with with explicit -O3 like the original library is doing. this allows the compiler to optimize the code much better than any modular source variant

The Linux Kernel also compiles zstd with -O3. I guess I don't quite know what you are saying here. Zstd doesn't have any cross translation-unit function calls in any of its hot loops. In fact zstd doesn't have any hot function calls in any of its hot loops, everything should be inlined (or it is cold). So there shouldn't be any benefit from combining multiple .c files into one, or from LTO.

compiling a full featured zstd into the kernel is also not possible since some function of zstd will have a stack usage of > 20kb. this is not kernel code compatible.

These functions aren't needed by zstd. There are a bunch of functions in huf_compress.c, huf_decompress.c, fse_compress.c, and fse_decompress.c that aren't used by zstd at all. They are present because those files come from the FiniteStateEntropy project. But this is a problem for the kernel because this will raise build warnings, even though the functions are unused. So in https://github.com/facebook/zstd/pull/2289 I am adding ZSTD_NO_UNUSED_FUNCTIONS which hides these functions. So you'll be able to define that in the next zstd version to hide these functions. After that macro is defined zstd has no functions that use more than 2KB of stack space.

In that PR I have also added a test that measures the actual stack high water mark for zstd (in the existing kernel use cases). After I reduced the compression stack usage by 1KB in that PR, I measured that compression needs ~2KB of stack space and decompression needs ~1KB of stack space. This could be reduced further with some further work.

@Ornias1993 if you want to open an issue that would be great. It would be a good place to make sure these answers don't get lost, and a reminder to use the new build macros when we release the next zstd version. I would also like to reduce zstd's actual stack usage further.

terrelln commented 4 years ago

@Ornias1993 the stack issue can be fixed, but will decrease the performance usually. but maybe there is also a better solution for this hufman table macro hell on local stack

To be clear these are the functions that are unused by zstd entirely. We use "workspace" variants of these functions to avoid the large stack usage. We allocate some space in the ZSTD_CCtx and ZSTD_DCtx once and use it as scratch space, instead of the stack.

BrainSlayer commented 4 years ago

i will check if reusing the whole context object will work. normally the context keeps track of the current compression flow/state, so its curious for me, that it works without clearing it on reuse

PrivatePuffin commented 4 years ago

@BrainSlayer If it works out, give me a call and i'll do the performance tests again too :) 3-5% might be pretty interesting :)

@terrelln Awesome change ZSTD_NO_UNUSED_FUNCTIONS is certainly something we wanna use for ZFS too 👍

BrainSlayer commented 4 years ago

@Ornias1993 if i do it i will just reimplement the init functions to keep the mempool more generic. but i dont think that it works without clearing the context

allanjude commented 4 years ago

i will check if reusing the whole context object will work. normally the context keeps track of the current compression flow/state, so its curious for me, that it works without clearing it on reuse

If I recall correctly, there is a ZSTD API for 'resetting' a CCtx to be able to reuse it. It only has to reset a few fields to make it safe, vs having to setup the entire CCtx.

It was one of the optimizations I was going to look at once I get the rest of the boot loader bits settled etc.

terrelln commented 4 years ago

i will check if reusing the whole context object will work. normally the context keeps track of the current compression flow/state, so its curious for me, that it works without clearing it on reuse

For compression, we use tables of indices into the compressed data to find matches. These tables start with the value of 0. After we compress data of say 5000 bytes, the maximum value is 4999. So the next compression call can "invalidate" entries below 5000 instead of re-zeroing the tables. Context reuse produces byte-identical outputs to single-use contexts, it is just an optimization.

PrivatePuffin commented 4 years ago

once I get the rest of the boot loader bits settled etc.

Its a bit offtopic on the offtopic, but how is the bootloader work going actually? :)

BrainSlayer commented 4 years ago

@Ornias1993 bootloader? i know that freebsd uses a own bootloader. the grub patch i did works

PrivatePuffin commented 4 years ago

@BrainSlayer Why are you asking me? go ask Allan, its his work not mine. Guess he is more than qualified to explain it way better than I understand it ;)

BrainSlayer commented 4 years ago

@Ornias1993 because you referenced a project which is new to me

mschilli87 commented 4 years ago

@BrainSlayer: @Ornias1993 replied to @allanjude bringing it up:

It was one of the optimizations I was going to look at once I get the rest of the boot loader bits settled etc.

So I undersrood Allan is working on it and even if if I had no idea who he was, a quick look at his profile picture would have me guess this work to be BSD-related. :wink:

BrainSlayer commented 4 years ago

@mschilli87 alan was talking about a bootloader. i dont see that this is related to any optimization

mschilli87 commented 4 years ago

@BrainSlayer:

alan[sic!] was talking about a bootloader. i dont see that this is related to any optimization

Allan was giving input on 'reuse-context optimization' pointing you to the API part about resetting a CCtx because he had this on his list of potential future enhancements. He just mentioned the bootloader work as a reason for not getting to it earlier and @Ornias1993 wen't on a tangent there because he is interested in this (independent from ZST in ZFS optimizations). He also clearly labelled his comment as off-topic. With a quick search you could find that this came up before on a PR you are probably very familiar with. :wink:

So no: The bootloader is not related to optimization other than @allanjude thinking about both and mentioning one when talking about the other, but @Ornias1993 still was interested and followed up. Then you seemed interested in / confused about that tangent and asked @Ornias1993 to give you the very details he just inquired from @allenjude and I simply tried to point you to the context I thought you must have overlooked when reading through this thread.

I am sorry if I further contributed to your confusion but as far as I understand you can safely ignore the bootloader topic unless you care about it independently from your interest in optimizing ZST on ZFS (like @allanjude and @Ornias1993 apparently do). :slightly_smiling_face:

PrivatePuffin commented 4 years ago

@mschilli87 As always: You made comment of the day and my day has just started :)

PrivatePuffin commented 4 years ago

@BrainSlayer SLight necro: Did you look into reusing the context yet?

BrainSlayer commented 4 years ago

@Ornias1993 thats what the mempool is doing at the end. i have not found any code which causes performance impact in the current way we are doing. we are already reusing allocated objects. we do not clear the memory on reuse. there is no memset done

PrivatePuffin commented 4 years ago

@BrainSlayer Thanks for the clearity, I was making a shortlist in my head for ZSTD todo's, good to know this is definately considered to be solved.

openzfs / zfs