twmb / franz-go

franz-go contains a feature complete, pure Go library for interacting with Kafka from 0.8.0 through 3.7+. Producing, consuming, transacting, administrating, etc.
BSD 3-Clause "New" or "Revised" License
1.78k stars 182 forks source link

fix setting lz4 compression levels #781

Closed asg0451 closed 2 months ago

asg0451 commented 2 months ago

resolves #778

twmb commented 2 months ago

Thanks!

asg0451 commented 2 months ago

@twmb am i good to merge here?

twmb commented 2 months ago

(deleted prior comments where I typo'd twice and you pointed that out)

I want to get this in the next patch release but can't figure out why ~#785~ ~#786~ #787 is failing (which is important to get in, probably the most important bug at the moment)

asg0451 commented 2 months ago

gotcha. not sure if i really have the bandwidth to debug that at the moment ;)

btw if i may ask, do these new race-y bugs affect the integration i just merged into cockroachdb? will we need to upgrade to fix this (once you fix it of course) before it makes it into a new release? we produce concurrently with max buffered records set, not bytes.

(posted in the wrong pr previously whoops)

twmb commented 2 months ago

The data race is largely non impactful (it's actually existed for a really long time, it's really hard to trigger, and at the architecture level, you're not going to see a partial write).

The #787 is worrying, but it only affects people that use MaxBufferedBytes. I don't think you do, it's a fairly new option and I don't expect most people to use it.

asg0451 commented 2 months ago

(if you want to move this discussion to another venue that's fine, lmk)

at the architecture level, you're not going to see a partial write

can you expand on this? what would the result be both downstream and in the running process

twmb commented 2 months ago

The data race is a read/write problem on a boolean. The standard problem with read/write races is that the reader experiences a partial read -- e.g. I'm writing "foobar" and the underlying data says "hacked", the reader reads "fooked" (half of foobar, half of hacked). No CPU architecture actually does a partial write of a boolean sized field.

The other problem is that compilers and CPUs can reorder where the read happens, but Go uses seqcst for any atomic operation, and there are atomic operations before and after the reading of this bool. So, this bool isn't being reordered to a place that it really matters.

The last worry here is that what this field is for isn't working properly. This field is meant to guard failing records if we don't know it's safe. There's a related open issue about this (not the bug, but "why was this hanging") where you can see what safe is, here: https://github.com/twmb/franz-go/issues/769#issuecomment-2235707417

So, if a race happens and what the bool is guarding isn't effectively guarded, what you'll see in your application is OutOfOrderSequenceNumber errors and you'll need to restart your client. This chunk of code has existed for 3yr and so far nobody has reported unexpected OOOSN errors.

asg0451 commented 2 months ago

gotcha. thanks for the context and the responsiveness!