[discussion] Peripheral singletons are not multi-core friendly

japaric commented 5 years ago

Background

Private Peripheral Bus (PPB)

All the peripherals in the cortex_m::Peripherals struct sit on the Private Peripheral Bus and have addresses of the form 0xE00x_xxxx. Most of these peripherals are interfaces to internal resources -- there's one instance of those resources per core. Examples of internal PPB peripherals are the NVIC, MPU, SYST (system timer), ITM, etc. The rest of peripherals are interfaces to external resources meaning that all cores access the same resources through these interfaces. An example of an external resource is the TPIU.

The bottom line here is that interacting with a singleton like CPUID on one core is different from interacting with it from another core even though they are supposed to be the same singleton. For example, the CPUID.base.read() operation can return different values depending on which core is executed.

Multi-core API

In RTFM land we are exploring multi-core applications in two modes: homogeneous mode and heterogeneous mode. Of course, it's very early days so we still don't know what the ecosystem will adopt but these APIs let us show the problems with the peripheral singletons.

Homogeneous

In this mode a binary crate is compiled using a single (compilation) target and a single ELF image is produced. The image contains the entry points for all cores and static variables have global visibility (visible to all cores) by default.

#![no_std]
#![no_main]

// visible to all cores
static X: AtomicBool = AtomicBool::new(false);

// core #0 user entry point
#[no_mangle]
unsafe extern "C" fn main_0() -> ! {
    // ..
}

// core #1 user entry point
#[no_mangle]
unsafe extern "C" fn main_1() -> ! {
    // ..
}

If this program is compiled for the thumbv8m.main-none-eabi then the resulting image can be flashed, for example, on a 2x Cortex-M33 device.

Heterogeneous

In this mode a binary crate is compiled for multiple (compilation) targets so multiple ELF images are produced. There's one image for each core and each image contains the entry point and static variables for that core. There's a special linker section named .shared used to share static variables between cores: make them visible to all cores. One opts into this .shared section using the #[shared] attribute; the default is that static variables are visible only to the core where it was defined.

#![no_std]
#![no_main]

// visible to all cores
#[shared]
static X: AtomicBool = AtomicBool::new(false);

// visible only to core #0
#[cfg(core = "0")]
static Y: AtomicBool = AtomicBool::new(false);

// each core has a *copy* of this static variable
static Z: AtomicBool = AtomicBool::new(false);

// core #0 user entry point
#[cfg(core = "0")]
#[no_mangle]
unsafe extern "C" fn main() -> ! {
    // ..
}

// core #1 user entry point
#[cfg(core = "1")]
#[no_mangle]
unsafe extern "C" fn main() -> ! {
    // ..
}

If this program is compiled for the thumbv7em-none-eabihf and thumbv6m-none-eabi targets then the resulting image can be flashed on a Cortex-M4F + Cortex-M0+ device, for example.

Issues

A. `Send` is wrong

By definition, Send means that it's (memory) safe to transfer ownership of a resource from one thread / core to another. In the case of these peripheral singletons transferring them from one core to another is wrong because that changes the meaning of the value.

// homogeneous mode

use cortex_m::peripheral::DWT;

// used as a channel
static X: spin::Mutex<Option<DWT>> = Mutex::new(None);

#[no_mangle]
fn main_0() -> ! {
    let p: cortex_m::Peripherals = ..;

    // this refers to core #0's DWT
    let dwt = p.DWT;
    *X.lock() = Some(dwt);

    // ..
}

#[no_mangle]
fn main_1() -> ! {
    loop {
        if let Some(x) = X.lock().take() {
            // now this refers to core _#1_'s DWT
            let dwt: DWT = x;
        }
    }

    // ..
}

The other issue with Send is that makes it possible to break the singleton invariant: one core can send another instance of e.g. DWT to a core that already has one such instance.

B. `take()` is unsound

The current implementation of cortex_m::Peripherals::take looks like this:

static mut CORE_PERIPHERALS: bool = false;

impl Peripherals {
    pub fn take() -> Option<Self> {
        interrupt::free(|_| {
            if unsafe { CORE_PERIPHERALS } {
                None
            } else {
                Some(unsafe { Peripherals::steal() })
            }
        })
    }

}

This is unsound in homogeneous multi-core mode because interrupt::free doesn't synchronize multi-core access to static mut variables; it only synchronizes accesses from the same core.

Using AtomicBool.compare_swap or a similar API would make this multi-core memory safe but that would not work on ARMv6-M because that CAS API doesn't exist on thumbv6m-none-eabi.

C. `take()` is wrong

The following program panics in homogeneous multi-core mode but should work.

#[no_mangle]
unsafe extern "C" fn main_0() -> ! {
    let p = cortex_m::Peripherals::take().unwrap();
    let now = p.DWT.cyccnt.read();

    // ..
}

#[no_mangle]
unsafe extern "C" fn main_1() -> ! {
    let p = cortex_m::Peripherals::take().unwrap();
    let now = p.DWT.cyccnt.read();

    // ..
}

This panics because both Peripherals::take are accessing the same guard. However, it is OK for each core to access its own DWT peripheral / cycle counter instance.

Potential countermeasures

`!Send`

To avoid issue (A) we could remove the Send implementation from all the peripheral singletons. The downside of this approach is we would also forbid sending a peripheral singleton from main or a interrupt handler to another within the same core.

`LocalSend`

Another alternative to avoid (A) is to remove the Send implementation from the singletons and instead implement a new LocalSend trait that means safe to send within execution contexts running on the same core. Then frameworks like RTFM can require the LocalSend for message passing within one core and the Send trait for cross-core message passing.

The downside of this approach is that it requires using the nightly channel because auto traits, which are required to bridge Send and LocalSend, are unstable.

// crate: local-send
pub unsafe auto trait LocalSend {}

// all cross-core Send-safe types are also core-local Send safe
unsafe impl<T> LocalSend for T where T: Send {}

// crate: cortex-m
pub struct DWT {
    _not_send: PhantomData<*mut ()>,
}

impl !Send for DWT {}
unsafe impl LocalSend for DWT {}

Core-local `take`

One way to deal with (B) and (C) is to have one guard static variable per core. Assuming a dual core system take would be written as follows:

// crate: cortex-m
// for core #0
static mut PERIPHERALS0: bool = false;
// for core #1
static mut PERIPHERALS1: bool = false;

impl Peripherals {
   pub fn take() -> Option<Self> {
       interrupt::free(|_| unsafe {
          let guard = if core_id() == 0 {
              &mut PERIPHERALS0
          } else {
              &mut PERIPHERALS1
          };

          if *guard {
              None
          } else {
              *guard = true;
              Some(Peripherals { .. })
          }
       })
   }
}

fn core_id() -> u8 {
    // returns `0` on core #0
    // returns `1` on core #1
}

// crate: app
#[no_mangle]
unsafe extern "C" fn main_0() -> ! {
    let p = Peripherals::take().unwrap();
    // ..
}

#[no_mangle]
unsafe extern "C" fn main_1() -> ! {
    let p = Peripherals::take().unwrap();
    // ..
}

The problem is implementing core_id in homogeneous mode. AFAICT, there's no Cortex-M memory mapped register that returns a "core id" (cf. RISC-V mhartid); nor there is a processor register that can be used to hold a "core id" (cf. RISC-V registers: x3 (global pointer) and x4 (thread pointer)) -- though one could use the usually unused PSP (Process Stack Pointer) register for this purpose.

Global singletons

A completely different approach to peripheral access that avoids the three aforementioned issues is a global singleton API. For example:

// crate: cortex-m
use bare_metal::CriticalSection;

// A global singleton
pub struct DWT;

// the actual peripheral
pub struct DWT_ {
    _not_send_or_sync: PhantomData<*mut ()>,
    pub cyccnt: CYCCNT,
    // .. other registers ..
}

impl DWT_ {
    // NOTE private
    unsafe fn new() -> Self {
        ..
    }
}

impl DWT {
    /// Grants temporary, synchronized access to the DWT peripheral
    pub fn borrow(cs: &CriticalSection, f: impl FnOnce(&DWT_)) {
        unsafe { f(&DWT_::new()) }
    }

    /// Grants temporary, unsynchronized access to the DWT peripheral
    pub unsafe fn borrow_unchecked(f: impl FnOnce(&DWT_)) {
        f(&DWT_::new())
    }
}

// crate: app
#[no_mangle]
unsafe extern "C" fn main_0() -> ! {
    interrupt::free(|cs| {
        DWT::borrow(cs, |dwt| {
            // accesses its own DWT
            dwt.cyccnt.write(0);
        });
    });

    // ..
}

#[no_mangle]
unsafe extern "C" fn main_1() -> ! {
    interrupt::free(|cs| {
        DWT::borrow(cs, |dwt| {
            // accesses its own DWT
            dwt.cyccnt.write(0);
        });
    });

    // ..
}

The downside of this approach is that because there's no ownership it's hard to build abstractions on top of peripherals. One could add a panicky API to seal peripherals:

impl DWT {
    /// Seals this peripheral; all future calls to `borrow` will panic
    ///
    /// This function panics if it's called twice
    pub fn seal(cs: &CriticalSection) {
        // ..
    }
}

Ownership could be emulated by first sealing the peripheral and then having the abstraction access the peripheral exclusively through the borrow_unchecked API.

pub struct Timer {
    // not Send because this semantically owns `SYST_` which is also not `Send`
    _not_send: PhantomData<* mut()>,
}

impl Timer {
    pub fn new() -> Self {
        // this operation effectively turns this type into an owned singleton
        SYST::seal();

        Self { _private: () }
    }

    pub fn set_timeout(&mut self, dur: Duration) {
         unsafe {
            SYST::borrow_unchecked(|syst| {
                // ..
            })
         }
    }
}

Internal vs external resources

In the case of external resources like the TPIU I think we want to keep the existing owned singleton / take API, with the semantics that only one core can take these peripherals, because more than one core should not access these resources in an unsynchronized fashion.

thejpster commented 5 years ago

AFAICT, there's no Cortex-M memory mapped register that returns a "core id"

Yeah, on the homogenous dual Cortex-M4 part of my Beagleboard X15 demo, I found the TI example using the Cortex-M4 Peripheral ID 0 register to determine which core was running. See https://github.com/cambridgeconsultants/rust-beagleboardx15-demo/blob/master/bare-metal/ipu-demo/src/main.rs#L435. Unfortunately it's vendor defined as to what they put in there :/ Perhaps it's something we can bring out in the chip crates, instead of being able to solve it for all Cortex-M cores.

The 'core local take' would work if the lock variable was in a core-specific address space, like the peripheral, right? So, like a thread-local variable. I think the AM5728 had a block of RAM for each core, each mapped to the same address range, but I don't know if that's a universal thing or just an oddity of this implementation. You'd also need to ensure each core initialised its own memory correctly.

jonas-schievink commented 4 years ago

I think the current implementations of Send are already correct. In standard Rust, Send models transferring data between threads within the same process (which have access to all of its resources), which is akin to moving between interrupt handlers running on the same core, or between cores that share a single set of resources (that is, all memory and peripherals).

Since most (all?) multi-core MCUs have core-local peripherals or memory, transferring objects between the cores is more similar to cross-process communication in standard Rust and should not be modeled with Send, but a custom trait.

rust-embedded / cortex-m