google / brotli

Brotli compression format
MIT License
13.49k stars 1.24k forks source link

Ideas for API improvements #384

Closed nemequ closed 7 years ago

nemequ commented 8 years ago

I recently rewrote Squash's brotli extension for the new API. IMHO the API is pretty nice, and I love that it's C not C++. That said, I do have a few ideas for improvements:

These are definitely not major issues, I just wanted to bring them up while changing the API is still an option.

nemequ commented 8 years ago

BTW, if you want I'd be happy to put together a PR for these, just tell me which one(s) you'd be okay with. Or, of course, just close the bug if you're not interested :)

eustas commented 8 years ago

Hi. Thanks for an offer, but I'm little bit hesitate, and also it will a little bit easier for me to fix it in internal repository and then publish update to github.

Ideas look great! I've planned to implement them next week (though it would require lot of courage and will to switch to snake case, I am old Java fan =).

nemequ commented 8 years ago

Hi. Thanks for an offer, but I'm little bit hesitate, and also it will a little bit easier for me to fix it in internal repository and then publish update to github.

Sounds good to me :). Let me know if you change your mind, though; I don't mind helping with this, especially considering how supportive you guys have been of Squash.

Ideas look great! I've planned to implement them next week (though it would require lot of courage and will to switch to snake case, I am old Java fan =).

Hah :). FWIW, the CamelCase thing is probably the least important of the ideas I mentioned; as long as you're consistent I don't think it's a big deal.

nemequ commented 8 years ago

Maybe I should create a new issue for this, but a related idea which may be helpful for your is to also have a callback for an allocation function of zeroed memory (i.e., calloc instead of malloc). The default implementation could just call malloc and memcpy, but calling calloc can be a bit faster since the OS often keeps around pages of pre-zeroed memory.

I'm not sure how much of a performance boost (if any) brotli would see, but I saw a pretty huge improvement in crush when I switched from malloc+memset to calloc… I think it's a bigger benefit for larger allocations because you don't have the cache hit from moving stuff into the CPU just to zero it.

eustas commented 8 years ago

Added a task to initial message.

nemequ commented 8 years ago

Added a few (4) more ideas to the list. Sorry to make this a bit of a moving target.

eustas commented 8 years ago

In the internal repo we already have conformant arrays. But it seems that this does not apply to streaming functions. Am I right?

nemequ commented 8 years ago

Depends on what you consider a streaming function, but I think you're correct.

If the size is a pointer, you can do something like

BROTLI_BOOL BrotliEncoderFinishStream(
    BrotliEncoderState* state, size_t* encoded_size, uint8_t encoded_buffer[*encoded_size]);

Which should work as expected. The length can even be an expression like *encoded_size * sizeof(uint64_t) if you want (you don't in this case, just wanted to point out that it's pretty flexible).

However, if you actually want to modify the array pointer itself (like you do for BrotliEncoderCompressStream, then you can't use a real array; it would have to be a uint8_t**. Something like

BROTLI_BOOL BrotliEncoderCompressStream(
    BrotliEncoderState* s, BrotliEncoderOperation op,
    size_t* restroct available_in, const uint8_t (* restrict next_in)[*available_in],
    size_t* restrict available_out, uint8_t (* restrict next_out)[*available_out],
    size_t* restrict total_out);

Is tempting, but it would break because an attempt to change the value of *next_in or *next_out would fail. So, assuming that's what you mean by the streaming functions, yeah, unfortunately that won't work :(.

Basically, anywhere in brotli you have a uint8_t** parameter, I don't think you'll be able to use an array. The good news is you can use restrict there, though…

BROTLI_BOOL BrotliEncoderCompressStream(
    BrotliEncoderState* s, BrotliEncoderOperation op,
    size_t* restrict available_in, const uint8_t * restrict * restrict next_in,
    size_t* restrict available_out, uint8_t * restrict * restrict next_out,
    size_t* restrict total_out);

Add in the fact that you'll really want to use BROTLI_ARRAY_PARAM/BROTLI_RESTRICT (or whatever you want to call it) macros, and I think it's pretty clear why so few people bother to properly use conformant array parameters and/or restrict.

jchampio commented 7 years ago

For funcions which return 1 on success and 0 on failure, as well as other booleans (like the is_last argument to BrotliEncoderWriteMetaBlock), please use a bool (or _Bool if you prefer) intsead of an int. Using bool helps make the code more readable and reduce documentation lookups.

A counter-argument: I don't think the bool concept really provides a stable, portable ABI in C.

I was experimenting with the new mod_brotli for the Apache web server, which makes use of the libbrotli shared library. Unfortunately, the module and the library disagreed on the definition of BROTLI_BOOL, because the module was compiled using C89 and the library was not. This led to a situation where the library returned "false" but the module interpreted that value as "true", which broke everything.

In my opinion, an ABI that changes depending on the C standard in use by the compiler is not going to be portable enough for widespread use. I agree with the idea that a boolean concept is more documentation-friendly than int, but the current implementation causes problems. (And I humbly submit that int is still the most "correct", idiomatic type for booleans in a portable C ABI.)

nemequ commented 7 years ago

These days, even MSVC supports stdbool.h (as of VS 2013). I'd prefer for Brotli to simply use the system bool everywhere and just require C99 (or at least the subset of it supported by VS 2013). Being stuck with C89 in 2016 is ridiculous.

C99 is pretty clear about boolean values; see § 7.16. true is defined as 1, false is defined as 0. Brotli defines BROTLI_FALSE as an enum value of 0 and BROTLI_TRUE as an enum value of !BROTLI_FALSE (which evaluates to 1), so the actual values shouldn't be an issue. That said, I'd probably change BROTLI_TRUE to just be 1 instead of !BROTLI_FALSE.

The _Bool type, OTOH, could be a problem. C99 specifies it as a type large enough to hold 0 or 1 (§ 6.2.5,2), so 1-byte is sufficient and that's what every implementation I'm aware of uses. However, most compilers use sizeof(int) for enums, and Brotli defines BROTLI_BOOL as an enum when in C89 mode. I'm guessing that's why you had a problem.

I agree with the idea that a boolean concept is more documentation-friendly than int, but the current implementation causes problems.

Agreed, implementing it as an enum is a bad idea.

(And I humbly submit that int is still the most "correct", idiomatic type for booleans in a portable C ABI.)

If you define "portable" as "C89+", then you're right. However, targeting C99 is portable, and bool is the idiomatic type for C99, which pretty much every non-toy compiler other than MSVC has supported for a very long time. Realistically, the only place _Bool/stdbool.h isn't going to be supported is people using old versions of MSVC. Everyone else supports them even in C89 mode.

sizeof(_Bool) on Windows is 1, so I would go with something like:

#if define(BROTLI_NO_STDBOOL) || (defined(_MSC_VER) && (_MSC_VER < 1800))
#  if !defined(bool)
#    define bool unsigned char
#  endif
#  if !defined(true)
#    define true 1
#  endif
#  if !defined(false)
#    define false 0
#  endif
#else
#include <stdbool.h>
#endif

Then you can get rid of all the BROTLI_BOOL cruft. Virtually every compiler/platform should work as expected. I'm sure there is some weird compiler somewhere which targets an sbscure DSP or something which won't work, but worst-case scenario people can explicitly define a few preprocesor symbols before including brotli.

Of course, if you're really attached to the BROTLI_BOOL idea, it would be easy enough to just #define BROTLI_BOOL unsigned char instead of using an enum. sizeof(_Bool) is 1 on Windows and Linux x86_64, and probably most other places, though I haven't checked.

bagder commented 7 years ago

As someone who'd like to use brotli as a library (which is why I made the libbrotli project that now looks like it will meet its maker soon), for a project that is C89/C90 compatible (curl) I would of course prefer a good old C89 compatible API so that my entire user base can have a chance of using all the brotli fun. But that of course also relies on everything else (at least in the decompressor lib) to be C89 as well, not just the API.

jimjag commented 7 years ago

Agreed. As a library, things should be as simple and as compatible and as PORTABLE as possible. It is not that much of a deal to stick w/ C89/C90 for this particular case, especially when the alternative breaks things horribly.

eustas commented 7 years ago

Hello @jchampio Please, could you describe toolchains you've used that caused "incompatible" bools. I'm going to modify BROTLI_BOOL definition in a way @nemequ proposed, but I'll need then to recheck that problem is gone.

Best regards, Eugene.

nemequ commented 7 years ago

As I mentioned, virtually everyone other than VS < 2013 supports _Bool (and stdbool.h), even in C89 mode. With what I suggested Brotli would still be completely usable in your projects.

Sticking to pure C89 hare isn't really about portability, it's about being technically C89 compliant. In practice the only place you're likely to run into a real portability problem is VS < 2013, and I suggested a solution for that which would work automatically.

There is a tiny group of people who really need pure C89. If you're using gcc, clang, msvc >= 12.0, icc, suncc, xlc, or really any other vaguely modern non-toy compiler, you are not in this group. This group consists of people using 20-year-old compilers, or targeting something exotic where only a specific compiler will work and that compiler only supports C89 (may be true for some DSPs or something). You're more likely to end up on a platform where endianness cannot be determined at compile-time than to have what I proposed be a real portability issue. However, even for that tiny group of people, what I proposed would only require defining a macro or two.

However, since it sounds like @eustas wants to stick with the BROTLI_BOOL thing, I'd suggest something like

#if !defined(BROTLI_BOOL)
#  if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L)) || \
       (defined(__GNUC__) && (__GNUC__ >= 3)) || \
       (defined(_MSC_VER) && (_MSC_VER >= 1800))
#    include <stdbool.h>
#    define BROTLI_BOOL bool
#  else
#    define BROTLI_BOOL unsigned char
#  endif
#endif
#define BROTLI_TRUE 1
#define BROTLI_FALSE 0

Almost every compiler out there pretends to be either GCC or MSVC, and sufficiently recent versions of both are whitelisted as supporting stdbool.h, as are compilers in C99+ mode. Other compilers (mostly old MSVC) will fall back on using unsigned char, but for the rare cases where that wouldn't be the correct type people can just define BROTLI_BOOL to something which works before including Brotli.

Note that this would be quite easy to adapt to use bool, true, and false instead of creating a separate type for Brotli; you would just have to guard against redefining true and false. That would give you a much nicer API.

jchampio commented 7 years ago

@nemequ

If you define "portable" as "C89+", then you're right.

Sure. I'm biased, since most of the C projects I've participated in are explicitly C89-compatible.

I also define "portable" as "capable of having C++ bindings written for it", and bool has (or used to have?) issues there too. Hopefully we've gotten to the point where C++ and C toolchains targeting the same architecture agree on the size of bool, but IIRC, it used to be a problem... maybe someone with more recent experience can comment on this?

Being stuck with C89 in 2016 is ridiculous.

Unless you're serious about being portable to a C89 toolchain subset, yes. ;) I mean, I hate old toolchains just as much as the next guy, but this is the game you play when authoring a C ABI for the masses. (And having a C89-subset ABI doesn't necessarily mean that you have to write your implementation in C89.)

virtually everyone other than VS < 2013

I'm veering off-topic, but I disagree with your assertion that the users of old MSVC versions are "a tiny group of people". MSVC9 and MSVC11 were pretty popular the last time I checked (which, to be fair, was a year ago... YMMV).

Anyway: if you're serious about using the BROTLI_BOOL thing, and assuming that you've made a decision only to support architectures where the most popular compilers agree that

then having the ability to override the definition of BROTLI_BOOL should help bail out the remainder.

jchampio commented 7 years ago

@eustas

Please, could you describe toolchains you've used that caused "incompatible" bools.

Sure -- I'm currently using GCC 5.4.0 on Ubuntu 16.04 (64-bit). Compile the libbrotli shared library as-is (with a modern C standard). For the client, compile and link against the library with the -std=c89 flag.

nemequ commented 7 years ago

Sure. I'm biased, since most of the C projects I've participated in are explicitly C89-compatible.

Technically brotli already isn't C89-only. It uses the types from stdint.h (i.e., uint8_t), except for MSVC < 16 where it defines those types itself, which is almost exactly (only the version of MSVC is different) what I'm suggesting it do for stdbool.h.

To make matters worse, the exact-size types ((u)intN_t) technically aren't even required by C99 (see § 7.18.1.1). Only the minimum-width and fast minimum-width types ((u)int_leastN_t and (u)int_fastN_t) are required. That said, everyone supports them; AFAIK they're really just option to support CHAR_BIT != 8, and I doubt Brotli would work if CHAR_BIT != 8 anyways.

I also define "portable" as "capable of having C++ bindings written for it", and bool has (or used to have?) issues there too. Hopefully we've gotten to the point where C++ and C toolchains targeting the same architecture agree on the size of bool, but IIRC, it used to be a problem... maybe someone with more recent experience can comment on this?

I'm not really a C++ person, but IIRC they have a bool type which is (can be or must be, not sure which) implemented with an enum. I'm not sure whether it is still a problem, but to get around it you can use _Bool (which is built-in, doesn't require a header) instead of bool. Maybe it would be safer to define BROTLI_BOOL as _Bool instead of bool, in which case you could just skip including stdbool.h altogether.

Being stuck with C89 in 2016 is ridiculous.

Unless you're serious about being portable to a C89 toolchain subset, yes. ;) I mean, I hate old toolchains just as much as the next guy, but this is the game you play when authoring a C ABI for the masses. (And having a C89-subset ABI doesn't necessarily mean that you have to write your implementation in C89.)

I don't care much about being portable to a generic, strictly C89 compiler. I care about being portable to the toolchains people actually use, even relatively old ones. C89 is mainly useful as a lowest-common-denominator, and if we want to use features outside C89 we should do so in a portable way. The tiny areas of Brotli (like booleans) which aren't strictly C89-compliant can still be portable while providing a good API for people who have better compilers.

virtually everyone other than VS < 2013

I'm veering off-topic, but I disagree with your assertion that the users of old MSVC versions are "a tiny group of people". MSVC9 and MSVC11 were pretty popular the last time I checked (which, to be fair, was a year ago... YMMV).

You're right, I didn't mean to include old MSVC in that "tiny group". It's part of a separate group which can easily be special-cased (and, in every implementation I've suggested, is). There are four groups

The first group can be taken care of by simply using stdbool.h or _Bool. Virtually every compiler pretends to be either GCC or MSVC (by defining things like __GNUC__ and/or _MSC_VER). The major exception is older versions of MSVC…

The second group is easy to take care of by defining either BROTLI_BOOL, bool, or _Bool to a type compatible with what later versions of MSVC use for _Bool. I don't have a Windows box with VS to test this with, but mingw says sizeof(_Bool) == 1, so I think unsigned char should work.

What's left, the third and fourth groups, is what is extremely tiny. You're really just left with toy compilers and probably a few special-purpose compilers targeting weird architectures. TBH I'd be surprised if the rest of Brotli worked here without any effort, but with what I've been suggesting these platforms would still work as the boolean type would just be #defined as unsigned char.

Finally, you have the group of compilers not pretending to be GCC or MSVC on platforms where booleans use multiple bytes of storage. If you compile Brotli with the same compiler as your application then the boolean type will be unsigned char and you're done. If, however, Brotli is compiled with a different compiler all you would have to do is define the boolean type to something compatible before including Brotli.

That covers every platform I can think of, which pretty much checks the "portability" box as far as I'm concerned.


So, just in case the C++ issue @jchampio mentioned, how about this:

#if !defined(BROTLI_BOOL)
#  if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L)) || \
      (defined(__GNUC__) && (__GNUC__ >= 3)) || \
      (defined(_MSC_VER) && (_MSC_VER >= 1800))
#    define BROTLI_BOOL _Bool
#  else
#    define BROTLI_BOOL unsigned char
#  endif
#endif
#define BROTLI_TRUE 1
#define BROTLI_FALSE 0

Another option would be to use the build system to determine the value of sizeof(_Bool) and write an appropriate type, but that would require running an executable compiled for the target architecture, which would be a problem for cross-compilers.

jimjag commented 7 years ago

Yeah, I'm just speaking myself as someone who is a consumer of the library (or, should I say, would be a consumer) for a popular web server which, for lowest-common-denominator reasons, has stayed w/ C89 semantics. So far, all testing has shown that the bool item is the only issue that hits us. Libraries, in general, should be agnostic as related to compiler differences between that used to build and the library and that used to build the application, or should have an ABI which resolves those issues in a portable fashion.

jchampio commented 7 years ago

Preface to my commentary: All I truly "need" from an upstream library is a stable API for me and a stable ABI for end users to link against. Choosing your own cost/benefit tradeoffs on how to implement that is, in the end, up to you; I'm just some guy on the Internet with an opinion, and I'll likely disappear after this issue. :smile: That said, I'll respond to @nemequ's last bit and then let you all duke out how you want to proceed...

I'm not really a C++ person, but IIRC they have a bool type which is (can be or must be, not sure which) implemented with an enum.

bool is a built-in type in C++; there's no enum "implementation" that I'm aware of.

to get around it you can use _Bool (which is built-in, doesn't require a header)

There is no _Bool type in C++. And including stdbool.h in C++ is not even guaranteed to provide a _Bool macro (it does in GCC, but as I understand it, that's an extension).

I want to reiterate that

  1. It's not just size you necessarily have to worry about. You want to make sure that the assembled code that deals with both types are equivalent, or you can bet that there will eventually be some bizarre corner case because of the mismatch.
  2. I don't know if bool-incompatibility between the two languages is a problem anymore. Modern compiler ABIs may have defined the two equivalently for the platforms you're targeting; I don't know.

Another option would be to use the build system to determine the value of sizeof(_Bool) and write an appropriate type, but that would require running an executable

If you're seriously considering this, I just have to know: what is so hate-worthy in the int-as-boolean idiom? :smile:

Anyway. I think I've said my piece and then some. As long as downstream doesn't break in weird corner cases, I'm happy.

eustas commented 7 years ago

Well, considering BROTLI_BOOL in API to be int in ABI we get both cleanliness and compatibility:

eustas commented 7 years ago

The only problem could be that result may have 4B+ values different from BROTLI_TRUE and BROTLI_FALSE, but brotli library swears not to return obscure values.

bagder commented 7 years ago

Any user of the API would still read the documentation for the function in which you of course explain what it can return. If it only ever returns 1 and 0, say so. It doesn't matter that the type is 32bit. That's not a problem to users.

nemequ commented 7 years ago

Sorry, forgot to reply to this for a while…

bool is a built-in type in C++; there's no enum "implementation" that I'm aware of.

I poked around the C++ spec a bit, and you're right. Based on what I read in the C++11 spec, as well as some posts I found around the internet, it seems like the way to be portable between C and C++ here is to use stdbool.h. Unlike C, C++ also doesn't specify (at least as far as I can tell) what the actual values for true and false are, so technically we really need to use true and false, though I would be very surprised if there is an implementation where they aren't 1 and 0 respectively.

So, I think the safest solution would be

#if !defined(BROTLI_BOOL)
#  if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L)) || \
      (defined(__GNUC__) && (__GNUC__ >= 3)) || \
      (defined(_MSC_VER) && (_MSC_VER >= 1800)) || \
      (defined(__cplusplus))
#    if !defined(__cplusplus)
#      include <stdbool.h>
#    endif
#    define BROTLI_BOOL bool
#    define BROTLI_TRUE true
#    define BROTLI_FALSE false
#  else
#    define BROTLI_BOOL unsigned char
#    define BROTLI_TRUE 1
#    define BROTLI_FALSE 0
#  endif
#endif

If you're seriously considering this, I just have to know: what is so hate-worthy in the int-as-boolean idiom? :smile:

There isn't anything inherently wrong with it (other than the wasted space, but who cares about that?). The real problem is that it's not clear what it is; every time I see an int return value I have to look up the documentation to see what the possible values are. It's not just humans, either; static analyzers and optimizing compilers have no way of knowing what the possible return values are. Enums would solve this, but they conflict with the standard booleans from C99+.

However, all that is moot now. Unless I'm missing something, 1.0 has been released, which means the API and ABI are stable. Changing BROTLI_BOOL to an integer now would be a break. Brotli doesn't specify a C standard for compilation, so C99 or C11 are used (depending on the version of the compiler used), which means existing compiles of brotli should all use stdbool. Basically, what @jchampio was seeing when compiling in C89 mode would happen to everyone since the ABI would suddenly be returning sizeof(int) bytes instead of sizeof(bool).

jimjag commented 7 years ago

stdbool.h is C99, not C89.

bagder commented 7 years ago

Why bother with the ifdef maze? Why not make it #define BROTLI_BOOL unsigned char for all? (and frankly, it is probably even faster for most situations to make that at a plain int instead). And why the weirdo true and false defines? In C code we've used TRUE (1) and FALSE (0) for decades so they are familiar and known. Introducing other names for them are only making your code harder to read and weird.

nemequ commented 7 years ago

stdbool.h is C99, not C89.

Which has come up several times in this issue. What's your point?

jimjag commented 7 years ago

The point is that C89 compatibility is the point, as mentioned quite a few times in the above comments.

nemequ commented 7 years ago

Why bother with the ifdef maze? Why not make it #define BROTLI_BOOL unsigned char for all?

  • That would be incompatible with C99+ and C++ (not sure when they added bool, but I'm pretty sure it's in C++98…).
  • Static analyzers and optimizing compilers can't make certain assumptions, so they'll be slightly less useful.

That said, it's a better solution than defining it as int or something else with a size other than sizeof(bool).

And why the weirdo true and false defines? In C code we've used TRUE (1) and FALSE (0) for decades so they are familiar and known. Introducing other names for them are only making your code harder to read and weird.

This one I mostly agree with, but unfortunately the API is set already. Removing them would break it.

The only real reason I can think of is to avoid conflicts with other libraries who try to define true and false (TRUE and FALSE are not the standard names, true and false are) themselves. As you said, it's pretty common for people to define those macros, so if they include brotli first and don't guard their own defines with ifdefs things go bad quickly. Admittedly it's a pretty minor concern; like I said, I mostly agree with you on this one.

The point is that C89 compatibility is the point, as mentioned quite a few times in the above comments.

No, the point is portability and, to a lesser extent, compatibility with other code. C89 isn't really required for the former, and it's counterproductive to the latter.

Even if portability depended on C89, the code I've posted works with pure C89 so you shouldn't have a problem. It just also whitelists a few compilers known to support using stdbool.h (even in C89 mode). If a compiler isn't known to support that it falls back on using unsigned char (which is available in C89), which is actually ABI compatible with what C99+ uses (as opposed to int, which isn't) with every compiler I've tried. And, if that's not enough (because maybe sizeof(bool) != sizeof(unsigned char)) you can even define BROTLI_BOOL to whatever you want before including the file for those weird corner case platforms.

As for compatibility, these days most people are using C99+ or C++ (remember, both clang and gcc default to C11 now, and before that it was C99 for a long time). For those people it's useful for brotli's boolean type to be compatible with their standard boolean type. In other words, compatibility actually decreases if you just define BROTLI_BOOL to int.

Finally, as I've mentioned, changing the return type in C99+ mode is a backwards-incompatible ABI break. Maybe the brotli devs are comfortable with that, but I certainly wouldn't be if it were my project.

bagder commented 7 years ago

most people are using C99

I like it how you just dismiss our claims that we value and treasure C89 compatibility so I won't waste more time here.

nemequ commented 7 years ago

I'm not dismissing your claims at all. As I've repeatedly said, what I proposed is C89 compatible. A pure C89 compiler with zero support for anything else would simply use unsigned char.

What I am dismissing is the idea that being C89 compatible means you can't use an extension on a few compilers known to support it which allows the API to work better for most people without harming C89 compatibility in any way. If you take away all the ifdefs, on a pure C89 compiler you end up with

#define BROTLI_BOOL unsigned char
#define BROTLI_TRUE 1
#define BROTLI_FALSE 0

How is that not C89 compatible?