grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Various race conditions reported by TSAN #199

Closed Civil closed 7 years ago

Civil commented 8 years ago

https://gist.github.com/Civil/e4aaffb76be7e216cc4f259e162b70b9 - TSAN report. Around 40 unique places where they happens.

Also please note that all TSAN stuff is 100% bugs - e.x. you are doing writes while having read mutex, atomic and not atomic writes from different threads at the same time, etc.

So to be sure that everything is correct, it's absolutely must to fix ALL of them.

grobian commented 8 years ago

Right, so the gcc atomic instructions are races? Most of them look like false positives, which the empirical proof also suggests.

dgryski commented 8 years ago

Any value accessed from multiple threads needs protection with either a mutex or to always be accessed with atomic intrinsics.

TSAN does not have any false positives. If it detects a race, then there really was a race: two threads tried to access the same piece of memory, and at least one was a write.

https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Civil commented 8 years ago

Also please note that it takes some time to actually understand the traces. All stuff that's about "gcc atomics access the value" is there mostly because there are places where the same value accessed in non-atomic way.

grobian commented 8 years ago

Yes, but unless I misunderstand what it warns about, in this case it doesn't matter that it races, because the code uses optimistic concurrency. If a counter increment fails, meh. If an availability check fails while it could have succeed, meh, better luck next time.

dgryski commented 8 years ago

The compiler generates correct code assuming there is no concurrent access to the values. Higher optimization levels or newer compilers can easily spill random values to memory as long as everything is back where it should be when the basic block is finished.

For example,

int done;

void func() {
   while(!done) {
    ....
   }
}

Assuming there is no access to done inside the while loop, the compiler is justified in replacing while(!done) with while(1), since that is the behaviour with no concurrent access.

Out-of-order execution, memory flushes, and silent memory corruption can lead to even more obscure issues.

TSAN warnings need to be fixed.

Civil commented 8 years ago

Also most of the warnings are about setting variable with CAS, while second thread sets it without CAS, e.x.: https://github.com/grobian/carbon-c-relay/blob/master/dispatcher.c#L579 https://github.com/grobian/carbon-c-relay/blob/master/dispatcher.c#L474

This will result it undefined behavior and the code can work correctly with one version of compiler, but even with minor increment of version can break in unpredictable way.

grobian commented 8 years ago

Explain to me how that can cause an issue, and I'll fix it. The only thing that needs to be prevented there is that multiple threads assign the same connection to itself, hence the atomic compare and swap, because I didn't want to pthread lock it. So, please enlighten me how in this scenario the releasing of the connection by the thread itself can cause a race condition.

Civil commented 8 years ago
  1. It's undefined behavior from compiler's point of View. You can't be sure that compiler won't optimize out some of your code.
  2. You can't be sure that compiler will produce proper cache invalidation. Imagine if one thread thinks that value is 0 and other that it's 1.
  3. When in one thread you are doing CAS and in other just a normal assignment (as in my examples, this code is proven to ran in different threads) result is undefined.

Without fixing this you can never be sure that aggregation code produces correct output and your data is mathematically correct which kinda good to have.

If you don't trust Damain or me, I'd suggest you to open a discussion in gcc maillist for example, about how correct your assumptions are at this moment.

grobian commented 8 years ago

As far as I understand, we're talking about a selection mechanism for threads to handle a certain connection or not. The aggregation code doesn't use these, but uses full pthread locks instead, so I don't yet understand why you think the aggregator produces mathematically invalid results.

Civil commented 8 years ago

You have various of undefined behaviors during writes and reads all over the aggregator pipeline, starting from parsing, and up to expire and dispatching. Because of the nature of that issues, you can't guarantee that the data will reach the aggregator at all. That's why that given some input you can't guarantee that your output will be the same everytime and that it'll be correct at all. Also behavior may change when you migrate from one version of compiler to another.

That's why there is very high chance that current aggregations are mathematically incorrect at least in some cases.

Yet I understand that you were saying about optmistic concurency, but the output of TSAN says that you are doing it wrong in some places (if you access to the variable with atomic operations you should do that everywhere, if you do atomic in one function and non-atomic in another, result is unpredictable).

And again, if you don't trust me or damian, I'd suggest you to open a talk in gcc maillist for example, provide them examples of code and ask if it's correct.

grobian commented 8 years ago

It's not about trusting people, it's about showing accurate examples of how it can go wrong. Sofar, both you and Damian have been talking about how much thing can go wrong, but without showing any relevant example (the while loop isn't relevant in this case!) proving or demonstrating it does.

You talk about repeatability of aggregations, but you seem to forget that by the nature of the streaming scenario, the aggregations will never be. Yet if you'd pick clock times carefully, you might be able to produce a unit-like-test that would result in the same aggregation values over and over again. That would be a nice start to further build on your argument that the mathematics behind the aggregations are incorrect. On a sidenote, I myself back in the day I was still working on this found that the results of the aggregator unfortunately severly suffer from incomplete inputs, something that's inherent to the whole streaming thing. For small/controlled scenarios the count field can give a confidence level here, but in larger/uncontrolled aggregations this is no longer feasible to check.

grobian commented 8 years ago

I think we're a great deal down right now, but it's not fully done.

grobian commented 8 years ago

I get clean runs now, but I'm not guaranteeing there's none left

dgryski commented 8 years ago

Thanks. Will try this out in our environment Monday or Tuesday and let you know.

grobian commented 8 years ago

I can guarantee that the aggregator got a lot slower because I had to apply a larger lock.

grobian commented 7 years ago

I think I addressed all TSAN issues