[malloc] Multi-threaded allocations

daneren2005 commented 10 months ago

I wanted to use malloc for a multi-threaded game. Being able to create a MemPool with a SharedArrayBuffer to a worker thread works great. I wanted to not only be able to use the memory in multiple workers, but I also wanted to be able to alloc and free memory in those threads as well. I looked through the code in https://github.com/thi-ng/umbrella/blob/develop/packages/malloc/src/pool.ts and don't see any uses of Atomics. I imagine this means that alloc's from multiple threads are going to corrupt the memory pool fairly quickly.

I thought about just adding a basic lock myself before allocating or freeing memory, but even that isn't going to really solve the problem without MemPool loading the values with atomics. It is still going to be possible for thread A to take a spot in memory and thread B will take the same spot because it has the old value in the CPU's cache.

I think we can safely free memory from multiple threads. If thread A marks a spot in memory as free, I think the worst case scenario is that thread B won't know that spot is free and not allocate it again until the cache is updated.

Is there a way we can make alloc's thread safe?

jtenner commented 9 months ago

Second this! I would love to have a shared memory pool between workers.

postspectacular commented 9 months ago

@daneren2005 @jtenner - sorry, completely fogot to answer this so far... the main reason for the lack of using atomics is that I had no need for those when this package was created, but then later also Spectre happened and SharedArrayBuffers (SAB) were defunkt for a while...

I guess the biggest question to answer here is if a lock-free approach is desired, because just using atomics/mutexes would imply short waits if there's contention for allocation/freeing. The former would require a complete re-think and a much more internal state management, but the latter (mutexes) would be much easier to implement. Another possibility to consider here is assigning a separate, isolated region of the SAB for each worker/process and so sidestep any synchronization requirements, but that might not work for your use cases...

jtenner commented 8 months ago

For my personal use case, it would be nice to allocate a single buffer and have a bunch of workers cooperate together with a lock free allocator.

However, I did consider your approach of having a single SharedArrayBuffer for each worker, and letting each worker manage their own memory. It would mean that referencing between each worker would require some kind of addressing convention:

The most interesting idea I came up with would be a 6bit integer worker id + 26bit (32 bit pointer) address into that worker's memory to help the threads communicate with each other. This method of communication is viable for sure!

However, a lock-free approach would simplify the API surface area for end developers and would result in a lot less work for performance critical processes. The primary use case I'm thinking about is a WebSocket server on bun piping messages to a scheduler. Requiring the socket worker to perform free operations could potentially get in the way of receiving and sending new messages. Again, this is solved by giving each worker its own arena, so maybe this request is a big ask.

Either way, I appreciate the quick response, and honestly wouldn't mind trying to get my hands dirty trying to port a memory allocator with a lot of guidance to help this project out.

daneren2005 commented 8 months ago

I haven't opened source it yet because it's not ready yet (I still need to make a bunch of the classes I've built on top of this thread safe), but I ended up just copying the MemPool code into my own repo. I'm trying to create a framework where a game entity is backed by a memory location in a SharedArrayBuffer. I created a gist of it (I renamed it to MemoryBuffer because my pool is multiple of them together): https://gist.github.com/daneren2005/e035b940429b649f500ca37cecdccc7b.

The lock implementation used is extremely simple:

const UNLOCKED = 0;
const LOCKED = 1;
export function lock(data: Int32Array, index: number = 0) {
    // Wait over and over again until we are one who set this from UNLOCKED to LOCKED
    while(Atomics.compareExchange(data, index, UNLOCKED, LOCKED) !== UNLOCKED) {
        if('WorkerGlobalScope' in self) {
            Atomics.wait(data, index, LOCKED);
        } else {
            // TODO: Spin-locks suck....
        }
    }
}
export function unlock(data: Int32Array, index: number = 0) {
    if(Atomics.compareExchange(data, index, LOCKED, UNLOCKED) !== LOCKED) {
        console.warn('We are unlocking when it was not locked!');
    }

    Atomics.notify(data, index);
}

export const SIMPLE_LOCK_ALLOCATE_COUNT = 1;

And I ended up implementing something super simple like jtenner mentioned where I stored 12 bits for buffer index and 20 bits for the size of each buffer (MemPool):

// bottom 12 bits (4096) for bufferPosition
// top 20 bits (1MB) for bufferByteOffset
const BYTE_OFFSET_BIT_COUNT = 20;
const POSITION_BIT_COUNT = 32 - BYTE_OFFSET_BIT_COUNT;
const MAX_BYTE_OFFSET_LENGTH = Math.pow(2, BYTE_OFFSET_BIT_COUNT);
const MAX_POSITION_LENGTH = Math.pow(2, POSITION_BIT_COUNT);

export function loadPointer(data: Uint32Array, index: number = 0) {
    return getPointer(Atomics.load(data, index));
}

export function storePointer(data: Uint32Array, index: number = 0, bufferPosition: number, bufferByteOffset: number) {
    Atomics.store(data, index, createPointer(bufferPosition, bufferByteOffset));
}

export function getPointer(value: number) {
    return {
        bufferPosition: value & 0b00000000000000000000111111111111,
        bufferByteOffset: value >>> POSITION_BIT_COUNT
    };
}
export function createPointer(bufferPosition: number, bufferByteOffset: number) {
    return bufferPosition + (bufferByteOffset << POSITION_BIT_COUNT);
}

export { BYTE_OFFSET_BIT_COUNT, POSITION_BIT_COUNT, MAX_BYTE_OFFSET_LENGTH, MAX_POSITION_LENGTH };

I figured 1MB buffers was good but you could obviously tweak this to be bigger if you would rather fewer larger buffers.

Back to the original question postspectacular asked, I think that a lock free implementation would be much better. I am not skilled enough to even attempt that, so I just went with a simple lock and Atomics since that was straightforward. You have to be EXTREMELY careful not to trigger locks in the main thread though. I found that even doing something simple like creating a bullet entity from the main thread would occasionally visibly lock the UI with the spinlock method I used. But that wasn't that hard of a problem to work around. My 2 cents is that it would be good to add the basic Atomics/locks to your implementation at first and if you felt like it trying to make a lock free version would be great.

thi-ng / umbrella

[malloc] Multi-threaded allocations #443