Fix data races in stbir_resize_extended_split by moving the call to s…

popizdeh commented 8 months ago

Thread sanitizer reports numerous data races, mostly when assigning the info struct inside stbir__update_info_from_resize. I'm also seeing visual corruption when running resizing on multiple threads. My guess would be torn reads and writes.

It looks like stbir__update_info_from_resize is not touching any of the per-slice data members which suggests it can be called only once instead of being called once per split. I've moved the call into stbir_build_samplers_with_splits as we need to call that function when setting up the multi-threaded slice resizing. The code can potentially be simpler if stbir__update_info_from_resize can be called before the call to stbir__perform_build. I don't understand the code enough to make this call.

jeffatrad commented 8 months ago

Yeah, this is slightly more complicated. Are you calling stbir_build_samplers_with_splits() back out, prior to the resizing? What should happen here is that if we need to do a reinit of the coefficients, then we need to fail, the call to stbir_resize_extended_split. So, if you update a parameter that requires resetting up the internal allocs, then it needs to fail. Nikola, can I see the the calls you make to stb_image_resize, so I can see if there is an order mismatch? You want to call stbir_build_samplers_with_splits as close to stbir_resize_extended_split as you can...

jeffatrad commented 8 months ago

Hmmm, it actually already does fail if you need to rebuild coeffs. I would expect a warning that we are setting the values to the same value when calling from multiple threads, but not sure where there would be any corruption. There might be two things going on here...

popizdeh commented 8 months ago

My code is roughly equivalent to this, real code is not starting threads for each resize.

stbir_resize_init(rawResizePtr, sourceBytes, width, height, 0,  destBytes, width, height, 0, layout, dataType);

stbir_set_edgemodes(rawResizePtr, STBIR_EDGE_CLAMP, STBIR_EDGE_CLAMP);
stbir_set_filters(rawResizePtr, STBIR_FILTER_CATMULLROM, STBIR_FILTER_CATMULLROM);
stbir_build_samplers_with_splits(rawResizePtr, numSpans);

for (int span = 0; span < numSpans; ++span)
    std::thread([span]{ stbir_resize_extended_split(rawResizePtr, span, 1); });

Here's more info on the races. Note that these are all 4 and 8 byte reads and writes that could be torn, so I'm not surprised there's corruption. What's more I'm compiling this with C++ and mere presence of data races is UB in C++, so compiler is free to reorder code in any way possible. I guess what I'm trying to say is that we probably need to focus on the architectural side of things instead of trying to figure out which race exactly causes corruption, they all need to be fixed. My early attempt of just slapping std::atomic on each variable didn't fix the issue.

stb_image_resize2.h:7152 Data race in stbir__update_info_from_resize(stbir__info, STBIR_RESIZE) at 0x1185b7000 info->input_data = resize->input_pixels; Two threads executed this at the same time!

stb_image_resize2.h:7153 Data race in stbir__update_info_from_resize(stbir__info, STBIR_RESIZE) at 0x1185b7000 info->input_stride_bytes = resize->input_stride_in_bytes; Two threads executed this at the same time!

stb_image_resize2.h:7154 Data race in stbir__update_info_from_resize(stbir__info, STBIR_RESIZE) at 0x1185b7000 info->output_stride_bytes = resize->output_stride_in_bytes;

The list goes on, it's also worth noting that most of these variables are read by the functions executed during resize (ie. stbir__decode_scanline).

nothings commented 8 months ago

I'm not super up-to-date on threads and data races and current terminology, but I think the point here is that there are two separate issues:

data races: since we probably don't want to introduce locking primitives into the library for cross-platform support reasons, this will probably just get an API change pushed out to the user to be responsible to do it once
read/write tearing: the impression I get from Jeff's comments is that every thread should reads a raced variable should be writing it first, and every thread that writes it should be writing the same value. ignoring the fact that data races are bad (because that is getting addressed in the first bullet point), then our expectation of the code is that every thread should write the same value, and write before it reads, and if in practice on current machines with current compilers that shouldn't cause corruption in practice, but we are seeing corruption, then it implies that our assumptions about what values are read/written--independent of data races--are wrong, i.e. there's a bug (or bugs) independent of the data races.

Jeff, please correct me if I'm wrong about this explanation, since I'm just guessing here.

popizdeh commented 8 months ago

Each thread is writing and reading the variable. It's writing when stbir__update_info_from_resize is called and it's reading a bit later from the guts of the resize. But with many threads there's no guarantees that writes happen before reads. Thread1 writes, then starts the resize and just as it's about to read the variable Thread2 starts its write...

jeffatrad commented 8 months ago

It's writing when stbir__update_info_from_resize is called and it's reading a bit later from the guts of the resize

Correct, but every thread does this - all writes will happen before any read, because every thread is doing them. It's still a data race warning and we'll fix that up - but will not change any behavior. C++ and UB are not relevant here.

Jeff, please correct me if I'm wrong about this explanation, since I'm just guessing here

Yes, that is correct. The data races can be fixed, but they will have no effect of the behavior of the code. So, if there is corruption, then that would be some other problem...

nothings commented 8 months ago

Each thread is writing and reading the variable. It's writing when stbir__update_info_from_resize is called and it's reading a bit later from the guts of the resize. But with many threads there's no guarantees that writes happen before reads. Thread1 writes, then starts the resize and just as it's about to read the variable Thread2 starts its write...

Under what condition should Thread1 reading the variable while Thread2 has "partly" written it cause corruption, if Thread2 is writing the exact same value that's already in the variable (it's the same value because that's the value Thread1 wrote)? Because that's what the scenario we've been describing is.

popizdeh commented 8 months ago

The fact that both threads are writing the same value is not relevant, what's relevant is that the write itself is not atomic. This can result in a torn write that another thread observes when it reads the value. I'm only speculating on where the corruption comes from, but consider info->input_data assignment, that's a 8 byte value, so we can think of writing to this variable as two 32 bit assignments. If the second thread reads the value in the middle of those two writes it's going to get either the low 32 or high 32 bits of the address to input data. You can probably see how this could cause corruption as it's going to be reading data from who knows where for that brief moment in time.

nothings / stb

Fix data races in stbir_resize_extended_split by moving the call to s… #1569