Fast (last) vector dimension being used by different threads might cause false cache sharing

hdante commented 1 week ago

Hello, this is a report of a conceptual performance issue in the code, but without any practical effects - when declaring the locChi2 and locInd multi-dimensional vectors, the dimension used for splitting the work into multiple threads was the last dimension. This means there will be vectors with contiguous data where each contiguous element is accessed by different hardware threads. Since typical cache lines are 64 bytes wide, the same cache lines will be requested and "locked" by multiple hardware threads at the same time, in practice serializing the access to the chi2 values, slowing down and prohibiting the parallelism in the loops using this vector.

Instead, the dimension used for distributing the work between the threads shouldn't be the last one, so that values dispatched to different threads are guaranteed to be in different cache lines.

Before submitting Please check the following:

[ ] I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
[ ] I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
[ ] If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

johannct commented 1 week ago

Do you think you can help us investigate this and the other performance issue you flagged? It would be great to have you in the developer team, given the input you provided so far.

hdante commented 1 week ago

Hello, Johann, I can help you, but these are not real performance issues. In this issue, the relevant loop is already fast and small, and changes to it won't noticeably impact the performance, so this issue is a matter of recommended practices and avoiding future problems. I've reported two other issues that might be potential multi-threading bugs. I still need to report a real performance issue found in the hot loop, where dictionary lookups are eating part of the performance, but the expected speed-up is around 10%.

I'll leave the suggestions for the reported issues as comments. For this issue, I'd solve it by changing the order of the affected vector's dimensions while assigning the first dimension for the thread ID:

vector<vector<vector<double>>> locChi2(number_threads, vector<vector<double>>(3, vector<double>(dimzg, 1.e9)));
vector<vector<vector<int>>> locInd(number_threads, vector<vector<int>>(3, vector<int>(dimzg, -1)));

Doing only this change does not guarantee that there will be zero false sharing because the different 1-D vectors of length dimzg might be aligned on 16-byte boundaries when using the default dynamic memory allocator, which is smaller than the 64-byte cache line, but this already moves away from the previous situation where there's 100% guarantee of false sharing because the values for different threads are 8-byte contiguous values by design. Another possible change would be to flatten this 3-D vector into a large 1-D vector, but given this partial solution, I wouldn't go any further to optimize an already low cost loop.

lephare-photoz / lephare

Fast (last) vector dimension being used by different threads might cause false cache sharing #198