Open popizdeh opened 8 months ago
Yeah, this is slightly more complicated. Are you calling stbir_build_samplers_with_splits() back out, prior to the resizing? What should happen here is that if we need to do a reinit of the coefficients, then we need to fail, the call to stbir_resize_extended_split. So, if you update a parameter that requires resetting up the internal allocs, then it needs to fail. Nikola, can I see the the calls you make to stb_image_resize, so I can see if there is an order mismatch? You want to call stbir_build_samplers_with_splits as close to stbir_resize_extended_split as you can...
Hmmm, it actually already does fail if you need to rebuild coeffs. I would expect a warning that we are setting the values to the same value when calling from multiple threads, but not sure where there would be any corruption. There might be two things going on here...
My code is roughly equivalent to this, real code is not starting threads for each resize.
stbir_resize_init(rawResizePtr, sourceBytes, width, height, 0, destBytes, width, height, 0, layout, dataType);
stbir_set_edgemodes(rawResizePtr, STBIR_EDGE_CLAMP, STBIR_EDGE_CLAMP);
stbir_set_filters(rawResizePtr, STBIR_FILTER_CATMULLROM, STBIR_FILTER_CATMULLROM);
stbir_build_samplers_with_splits(rawResizePtr, numSpans);
for (int span = 0; span < numSpans; ++span)
std::thread([span]{ stbir_resize_extended_split(rawResizePtr, span, 1); });
Here's more info on the races. Note that these are all 4 and 8 byte reads and writes that could be torn, so I'm not surprised there's corruption. What's more I'm compiling this with C++ and mere presence of data races is UB in C++, so compiler is free to reorder code in any way possible. I guess what I'm trying to say is that we probably need to focus on the architectural side of things instead of trying to figure out which race exactly causes corruption, they all need to be fixed. My early attempt of just slapping std::atomic on each variable didn't fix the issue.
stb_image_resize2.h:7152 Data race in stbir__update_info_from_resize(stbir__info, STBIR_RESIZE) at 0x1185b7000
info->input_data = resize->input_pixels;
Two threads executed this at the same time!
stb_image_resize2.h:7153 Data race in stbir__update_info_from_resize(stbir__info, STBIR_RESIZE) at 0x1185b7000
info->input_stride_bytes = resize->input_stride_in_bytes;
Two threads executed this at the same time!
stb_image_resize2.h:7154 Data race in stbir__update_info_from_resize(stbir__info, STBIR_RESIZE) at 0x1185b7000
info->output_stride_bytes = resize->output_stride_in_bytes;
The list goes on, it's also worth noting that most of these variables are read by the functions executed during resize (ie. stbir__decode_scanline
).
I'm not super up-to-date on threads and data races and current terminology, but I think the point here is that there are two separate issues:
Jeff, please correct me if I'm wrong about this explanation, since I'm just guessing here.
Each thread is writing and reading the variable. It's writing when stbir__update_info_from_resize
is called and it's reading a bit later from the guts of the resize. But with many threads there's no guarantees that writes happen before reads. Thread1 writes, then starts the resize and just as it's about to read the variable Thread2 starts its write...
It's writing when stbir__update_info_from_resize is called and it's reading a bit later from the guts of the resize
Correct, but every thread does this - all writes will happen before any read, because every thread is doing them. It's still a data race warning and we'll fix that up - but will not change any behavior. C++ and UB are not relevant here.
Jeff, please correct me if I'm wrong about this explanation, since I'm just guessing here
Yes, that is correct. The data races can be fixed, but they will have no effect of the behavior of the code. So, if there is corruption, then that would be some other problem...
Each thread is writing and reading the variable. It's writing when
stbir__update_info_from_resize
is called and it's reading a bit later from the guts of the resize. But with many threads there's no guarantees that writes happen before reads. Thread1 writes, then starts the resize and just as it's about to read the variable Thread2 starts its write...
Under what condition should Thread1 reading the variable while Thread2 has "partly" written it cause corruption, if Thread2 is writing the exact same value that's already in the variable (it's the same value because that's the value Thread1 wrote)? Because that's what the scenario we've been describing is.
The fact that both threads are writing the same value is not relevant, what's relevant is that the write itself is not atomic. This can result in a torn write that another thread observes when it reads the value. I'm only speculating on where the corruption comes from, but consider info->input_data
assignment, that's a 8 byte value, so we can think of writing to this variable as two 32 bit assignments. If the second thread reads the value in the middle of those two writes it's going to get either the low 32 or high 32 bits of the address to input data. You can probably see how this could cause corruption as it's going to be reading data from who knows where for that brief moment in time.
Thread sanitizer reports numerous data races, mostly when assigning the
info
struct insidestbir__update_info_from_resize
. I'm also seeing visual corruption when running resizing on multiple threads. My guess would be torn reads and writes.It looks like
stbir__update_info_from_resize
is not touching any of the per-slice data members which suggests it can be called only once instead of being called once per split. I've moved the call intostbir_build_samplers_with_splits
as we need to call that function when setting up the multi-threaded slice resizing. The code can potentially be simpler ifstbir__update_info_from_resize
can be called before the call tostbir__perform_build
. I don't understand the code enough to make this call.