Note: I am not proposing any changes to https://github.com/mmtk/mmtk-core/pull/643. The algorithm works, and its style is consistent with other parts of mmtk-core. I am talking about a more general issue in mmtk-core.

TL;DR: We have Chunk(Address), Block(Address), Line(Address) and Page(Address). They implement Copy, and has pointer semantics, but does not own the region they point to. This approach is not idiomatic in Rust. It does not prevent two pointers to point to the same region mutably, and the programmers need to add hard-to-reason-about locks even when they have &mut access. We may need to introduce a set of owning types that own those regions.

Background

Rust ownership and references

Rust has&T, &mut T and T.

&mut T is an exclusive (unique) borrowing reference, and it grants read-write access.

By enforcing the borrowing rules, Rust programmers are confident that if a variable holds a &mut T, it is the only variable that refers to the T, and it has read-write access to it. No other &mut T, not even &T, could coexist. For this reason, Rust often advertise as having "fearless concurrency". If you respect the borrowing rules, there can be no data race.

Region types

The Region trait represents an aligned region of memory of a certain size. Instances include Chunk(Address), Block(Address), Line(Address) and Page(Address). The PR https://github.com/mmtk/mmtk-core/pull/643 also introduces a new type Block(NonZeroUsize). All of them are wrappers of addresses.

Region instances implement Copy. They can be copied just like C pointers. And they can be constructed from addresses. In this way, given an address range, we can iterate through all aligned regions in that range. Each region is constructed from the address.

// In mmtk::policy::immix::chunk::Chunk::sweep
for block in self.blocks().filter(/* ... */) {
    // A new Block instance is created in each iteration.
    if !block.sweep(/* ... */) { // Accessing block mutably
        // ...
    }
}

As we can see, in each iteration, we create a Block instance from an address, and access the block mutably (sweep).

This means any thread can gain read-write access to a Block without borrowing or getting ownership, and dangers await.

The problem

The mysterious lock

When reviewing the PR https://github.com/mmtk/mmtk-core/pull/643, I noticed an interesting piece of code.

impl Block {
    // ... other methods here
    #[inline(always)]
    pub fn attempt_release<VM: VMBinding>(self, space: &MarkSweepSpace<VM>) -> bool {
        // ...
                    let block_list: *mut BlockList = loop {
                        let list: *mut BlockList = self.load_block_list();
                        (*list).lock();
                        if list == self.load_block_list() {
                            break list;
                        }
                        (*list).unlock();
                    };
                    (*block_list).remove(self);
                    (*block_list).unlock();
        // ...
    }
    // ... other methods here.
}

This code snippet tries to remove a Block from a BlockList (a doubly-linked list), but another thread may have moved the Block to another BlockList concurrently. So it has to try to lock the BlockList in a loop until it successfully locked the BlockList and the Block is still in that BlockList.

This piece of code puzzled me in two ways.

If we already have a *mut T, why there is still a race? Why don't we have unique access to it? (OK. It is pointer, not reference. But it is still counter-intuitive to Rust programmers.)
What operation could the other thread be performing? If it is racing, it is racing against something.

To answer question (2), I found the two places that tries to acquire the same lock. One (the shown snippet) is sweeping chunks, and the other is releasing mutators.

During the Release stage, the MarkSweep collector does two things concurrently.

a. For all chunks, find all blocks in it that are unmarked, and release them from their BlockList (the shown snippet). b. For all mutators, transfer all blocks from the available_blocks BlockList and the consumed_blocks BlockList into the unswept_blocks BlockList.

When doing both concurrently, each block may be reached from two different places, namely through chunks and through mutators. If not synchronised properly, it may leave the doubly-linked list in an inconsistent state. Therefore, a spinlock is added to the BlockList struct, and used to prevent this race. The lock is obtained using the lock() method, as shown in the snippet above.

The lock solved the racing problem here. But I am curious why this problem still occurs in Rust. Shouldn't Rust protect us from this kind of race?

Why does this happen?

The root cause is that Block implements Copy and can be created from addresses. It has pointer semantics. It points to a Block, but does not represent ownership. So all the ownership-transfer and borrow-checking mechanisms stopped working in this use case. Any thread can gain mutable access by simply creating a Block instance.

The SweepChunk work packet uses RegionIterator<Block> over the Chunk to gain access. It bypassed the borrow checker, allowing two threads to hold &mut BlockList simultaneously.

But why we don't need locks in other places?

If the Chunk and Block types are inherently unsafe, I would expect the PR https://github.com/mmtk/mmtk-core/pull/643 has lock/unlock operations everywhere. But it's not the case. BlockList is usually accessed without locking.

A BlockList represents a list of Block of certain size. It has several mutable methods: remove(), push(), pop(), append() and reset().

On the fast path of alloc(), the mutator tries to pop() a block from the BlockList in the available_blocks array owned by the allocator which is owned by the current mutator. Since the mutator is calling pop() on a BlockList it already owns, it shall not cause data race.
If that fails, the mutator will try to recycle local blocks (if not doing eager sweeping). It will pop() a Block from its local unswept_blocks list, sweep it, and retry allocating from that block. It is still accessing blocks and block lists it owns, so there is still no data race.
If that fails, too, the mutator will enter the slow path, and try to acquire blocks from the global pool. This time, it accesses MarkSweepSpace::abandoned which is a Mutex<AbandonedBlockLists>. This forces all mutators to acquire a lock before accessing the lists within, and the lock prevents data race.
If that still fails, it will ask the PageResource for fresh pages. PageResource uses Mutex internally to prevent data race.

Observed from this analysis, we see our MiMalloc implementation is actually a good example of using ownership to prevent data race. The fast path accesses owned data so there can't be data race; the slow path accesses global data protected by Mutex so it prevents data race; once the blocks are transferred from the global pool to the local BlockList, subsequent allocations from that block will be local, and can't have data race, either.

Unfortunately, the ownership model is only in the developers' head, but not reflected in the code. Block implements Copy, so once assigned, the original owner may still keep a Block instance that represents the same block. In Rust, this shouldn't happen. Once the ownership is transferred, the original owner no longer has access to it. This ensures the global pool cannot give the same Block to two different mutators.

Throughout the code, Block is treated as a pointer. Therefore, whether to use &self or &mut self is only a matter of whether it mutates the block.0 field of the Block(NonZeroUsize) instance. We can see that Block::sweep(&self) actually takes a &self as parameter, even though it writes into the block.

I would not disapprove the MiMalloc PR for this, because this is common in MMTk core. The immix::Block is the same. All of its methods take &self parameter, including immix::Block::sweep(&self).

So what kind of `Block` type should we have?

We need something that represents the ownership of a block. For this, the Block type should not implement Clone or Copy. MarkSweepSpace::acquire_block should be the only place where Block instances can be created, and it creates Block from the address returned from PageResource when allocating "fresh" blocks. Block instances are destroyed in Block::attempt_release(self) (:thumbup: to the owning self parameter), when its underlying pages are released back to the PageResource. When transferring a Block from one BlockList to another, or from the global pool to the mutator, Rust's ownership model will ensure no two mutators share the same Block, and even no two BlockList can hold the same Block at the same time.

What about sweeping all blocks in a Chunk? From Rust's point of view, the mutator owns the block, because the MarkSweepSpace creates Block from the address returned from PageResource, and transfers it to the mutator. Mutators, however, can abandon blocks so they are transferred to MarkSweepSpace. If each Block is owned by either a mutator or MarkSweepSpace, I would argue that it is completely unnecessary to sweep blocks by iterating chunks. Instead, we can simply iterate through each block in each mutator and the MarkSweepSpace, see if it is marked, and remove it from its BlockList. That's all the blocks out there. This can be parallelised by creating one work packet for each mutator, and one more for the global MarkSweepSpace. And we don't need any lock on the BlockList, because each BlockList is also owned by exactly one mutator (or the MarkSweepSpace) and the race is impossible.

Known issues

Is this approach applicable to other GC algorithms, like Immix?

mmtk / mmtk-core

Owning region types #696

Background

Rust ownership and references

Region types

The problem

The mysterious lock

Why does this happen?

But why we don't need locks in other places?

So what kind of `Block` type should we have?

Known issues

mmtk / mmtk-core

Owning region types #696

Background

Rust ownership and references

Region types

The problem

The mysterious lock

Why does this happen?

But why we don't need locks in other places?

So what kind of Block type should we have?

Known issues

So what kind of `Block` type should we have?