ta0kira / zeolite

Zeolite is a statically-typed, general-purpose programming language.
Apache License 2.0
18 stars 0 forks source link

Memory leak induced by race condition related to `BoxedValue`. #186

Closed ta0kira closed 2 years ago

ta0kira commented 2 years ago

tests/leak-check/LeakTest forever shows that there's a slow memory leak that didn't exist in v0.18.1.0. The rate seems to be about 1 object per ~300 iterations, which means that the leak is unlikely to actually occur in programs that aren't specifically designed to cause race conditions.

The issue seems to not be present as of 51f75bd087f60b640893f37c17da0097d590588f (single allocation for BoxedValue counters and object) but present as of e058732b249d8b5e786eac89ece07af9455da067 (switch from Var_self to VAR_SELF.)

It's almost certainly a race condition because valgrind doesn't detect any memory leaks.

ta0kira commented 2 years ago

Calling union_.value_.as_pointer_->object_->~TypeValue(); in WeakValue::Cleanup doesn't seem to fix the problem. The issue is therefore probably due to BoxedValue::FromPointer getting called with a value that isn't the most-derived class.

Looks like AnonymousOrder is at least part of the problem, since it uses VAR_SELF but isn't the most-derived class.

ta0kira commented 2 years ago

Still a problem after 3bd6837c0fc4e65c00068129834aaa8ecfd18537.

ta0kira commented 2 years ago

This actually was also a problem in 6f4a65d56c5f8add68ac88f6c7df21dbb7224e7a.

ta0kira commented 2 years ago

Looks like it was also an issue in 0.18.0.0.

ta0kira commented 2 years ago

One possibility is that multiple threads are accessing the same BoxedValue rvalue, rather than creating a new copy of the reference.

ta0kira commented 2 years ago

Capturing variables by value and excluding this and Var_self in ExtValue_ProcessThread::Call_start doesn't seem to change anything. (Making changes to 0.18.0.0 for testing.)

ta0kira commented 2 years ago

After adding reference validation to LeakTest:

tests/leak-check/LeakTest: Failed condition: Leaked LeakTest checkedValue at 0x1e64680 (S: 0 W: 2)
  From LeakTest.doIteration at line 71 column 5 of "tests/leak-check/leak-check.0rx"
  From LeakTest.runWith at line 35 column 7 of "tests/leak-check/leak-check.0rx"
  From LeakTest.run at line 19 column 7 of "tests/leak-check/leak-check.0rx"
  From main

One weak reference is from checkedValue and the other is (probably) from BoxedValue skipping the cleanup step.

ta0kira commented 2 years ago

So far all of the errors from SimulateRefs have this exact order of operations:

Error in final state: object not cleaned up after last reference
Final state: S: 0 W: 1 AC: 0 WC: 0
  weak1[ToStrongEnter]: S: 1 W: 3 AC: 0 WC: 0
  shared1[DropSharedEnter]: S: 4097 W: 3 AC: 0 WC: 0
  weak2[ToStrongEnter]: S: 4096 W: 3 AC: 0 WC: 0
  weak1[ToStrongAlive]: S: 8192 W: 3 AC: 0 WC: 0
  weak1[ToStrongDiscardWeak]: S: 4097 W: 3 AC: 0 WC: 0
  weak1[DropSharedEnter]: S: 4097 W: 2 AC: 0 WC: 0
  weak2[ToStrongDead]: S: 4096 W: 2 AC: 0 WC: 0
  weak2[ToStrongCheckWeak]: S: 0 W: 2 AC: 0 WC: 0

The obvious problem seems to be that cleanup on the strong reference starts before the lock is subtracted for the unsuccessful second strong operation. (I don't think that would have ever occurred to me.)

ta0kira commented 2 years ago

The solution was to use a proper lock for the two places that the fake lock was being used, and actually make it a spinlock.

Therefore, it's sufficient to just have the counters be atomic for all other uses, and use a spinlock for both cases above.


I went with the fake lock initially because I'd assumed that a spinlock would need to be used everywhere, rather than just in those two places. Using a spinlock everywhere made performance worse than just using std::shared_ptr, and the fake lock fixed that performance problem.