Unity's spinlocks are bullying the Garbage Collector, and Refunding is not helping on the situation.

Lisias commented 3 years ago

Launch this craft with Recall installed.

See hell breaking loose over the poor Garbage Collector's head.

bully-scholarship-edition-thumb04

(There's also a memory leak happening on 3rd parties, completely unrelated to Recall)

Lisias commented 3 years ago

Process Sample (MacOs) if the 261 in progress:

Sample of KSP.txt

Note a lot of threads at 100% CPU on semaphore_wait_trap (they did spinlocks on a bytecode virtual machine? really???), and the GC thread being chewed also at 100% on GC_header_cache_miss and GC_generic_malloc.

Lisias commented 3 years ago

This issue is not a bug on Refunding. The inherent problem is a terrible decision made by Unity on using spinlocks instead of mutexes on critical sections of code related to memory.

This also explains a lot of different problems KSP is suffering from a long time.

See these posts on forum:

Lisias commented 3 years ago

Renaming the issue to properly address the situation.

KSP-Recall's Refunding is, indeed, abusing the GC a bit - but this is not what's causing the process to a deadlock. At worst, I would be causing some stuttering while launching the craft.

But since fixing Unity is beyound the scope of KSP-Recall, I will have to tackle down this somehow on Refunding, perhaps hindering a bit Refunding itself (it will depend if I manage to keep the critical code inside the FixedUpdate, or I will have to move to the Update and risk being not called).

mgalyean commented 3 years ago

Is the spinlock at the mono level and perhaps tunable via mono environment variables? Ideally there would be a way to tell it to use mutexes instead of spinlocks for the GC at least, but this is very unlikely to be doable. There are a lot of GC env var options but I don't recall seeing anything specific to how the locking works skimming the vars

Lisias commented 3 years ago

Is the spinlock at the mono level and perhaps tunable via mono environment variables? Ideally there would be a way to tell it to use mutexes instead of spinlocks for the GC at least, but this is very unlikely to be doable. There are a lot of GC env var options but I don't recall seeing anything specific to how the locking works skimming the vars

Nope, they are on the C code:

    1279 Thread_2729384: Finalizer
    + 1279 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff73d0040d]
    +   1279 _pthread_start  (in libsystem_pthread.dylib) + 66  [0x7fff73d04249]
    +     1279 _pthread_body  (in libsystem_pthread.dylib) + 126  [0x7fff73d012eb]
    +       1279 GC_start_routine  (in libmonobdwgc-2.0.dylib) + 28  [0x115be9987]
    +         1279 GC_inner_start_routine  (in libmonobdwgc-2.0.dylib) + 90  [0x115be99f2]
    +           1279 start_wrapper  (in libmonobdwgc-2.0.dylib) + 652  [0x115b75b1b]
    +             1279 finalizer_thread  (in libmonobdwgc-2.0.dylib) + 671  [0x115baee76]
    +               1279 semaphore_wait_trap  (in libsystem_kernel.dylib) + 10  [0x7fff73c42256]`

https://github.com/Unity-Technologies/bdwgc/tree/6e8db07c8d2e45fe60016a95c343912e038e5e9d

But even if we had a way to limit this using some configuration option, we will still have to co-exist with the rest of the Operating System.

The only (and I really mean only) place where spinlocks are safe to use is on embedded systems where you have FULL control of every piece of the hardware you are using. And so you can define how much threads you have on the whole system in advance.

mgalyean commented 3 years ago

So the only option would be a purpose made c(++?) library on an OS specific basis with a spinlock API that actually did mutexes under the hood...and the KSP env would have to be setup so mono could only see that flavor of the library. And if the library is compiled into the runtime then there is nothing that can be done without binary editing/redirecting in the runtime.

Lisias commented 3 years ago

So the only option would be a purpose made c(++?) library on an OS specific basis with a spinlock API that actually did mutexes under the hood...and the KSP env would have to be setup so mono could only see that flavor of the library. And if the library is compiled into the runtime then there is nothing that can be done without binary editing/redirecting in the runtime.

One size does not fits all.

You need a specialised library for every OS, exploring its strongness and avoiding its weakness.

Lisias commented 3 years ago

I reworked the code to prevent removing a Refunding resource once it's created. It solved the deadlocks, the process get only 13 to 15% of CPU load (i7, cores, 8 threads -- max cpu load is 800%).

However, the memory continued to go through the roof. With that 1.000 parts craft on the OP, my machine just can't launch it - the process get to 32GB of memory allocated, with 30GB of compressed memory and about 2GB of "Real Memory" allocated.

Technically, this solved the issue : by avoiding destroying the Resource, I prevented the need of GC and, so, the spinlocks didn't fired. But he memory continued to blow up the roof. :(

Code on commit https://github.com/net-lisias-ksp/KSP-Recall/commit/9b74455ab4216249a3283b3412e3a460664b947d

Lisias commented 3 years ago

The changes on commit https://github.com/net-lisias-ksp/KSP-Recall/commit/9b74455ab4216249a3283b3412e3a460664b947d does what follows:

1) Create the resource on demand. 2) this.part.Resources.Add it 3) Eventually this.part.Resource.Remove when appropriated, to prevent the fake resource to screw up parts that some add'ons (or even KSP) don't expect any resource. Stackable parts also gets the fake resource removed nevertheless it's needed or not, as I didn't managed to support them properly. 4) The resource is this.part.Resources.Add back when there is a expectation that the craft could be recovered.

The fake resource is not destroyed anymore - it is saved on a private variable and constantly reused.

So, Refunding is not responsible for the memory leak - it's only the trigger. The memory leak is happening on the this.part.Resources.Add call!!!

IT'S A KSP BUG. Yeah, another one.

This is royally screwing up the hack - I just can't shove the fake resource on every part via Module Manager, as this would screw up Add'Ons that relies on some parts having NO resources. Stackable parts are just one of the problems I need to prevent.

But if I add them at runtime, lots and lots of memory are allocated for no apparent reason - 93% of the memory ends up compressed, only 7% of the process memory is really being used on a given moment. This strongly hints me on a pretty nasty memory leak on the Modules.Add code.

Lisias commented 3 years ago

Flagging this as "Not my fault".

Refunding was not being the most efficient as possible but the code was working fine (and given my time constraints at that time, I'm not ashamed of it).

Refunding is now being the most efficient possible (besides wasting some memory as the fake resource is only destroyed when the part is destroyed now), what's kinda "fixes" the problem described on this issue.

However, the memory leak persists. It will be tackled down (or not....) on issue #23.

Lisias commented 3 years ago

Ping @mgalyean .

Lisias commented 3 years ago

Closing this as the code was successful on prevent triggering the spinlock madness. (the madness is still there to be triggered by another innocent code, but there's so much I can do about).

net-lisias-ksp / KSP-Recall

Unity's spinlocks are bullying the Garbage Collector, and Refunding is not helping on the situation. #21