In place initialization of the `Matter` object, the buffers and subscriptions

ivmarkov commented 3 weeks ago

(UPDATED: The section "Why is this PR still a draft" with recent progress.)

What follows below is a "mini-RFC" of sorts, justifying why we need this change, how it is implemented, next steps and so on.

What is the problem?

Initialize the Matter object, as well as other large objects (like IM handler's PooledBuffers) by avoiding the possibility of stack blowups.

Status Quo

This problem is already solved actually!

The reason why we can initialize the Matter structure... (and other structures, but from now on I'll talk only about Matter as everything applicable to Matter is also valid for the other large non-Future structures we are having, like PooledBuffers and Subscriptions)... so the reason why we can initialize it without risking stack memory blow-ups is because it has a const constructor. In other words, Matter::new(...) can be called from a const context.

How does that help?

You can simply use static_cell's ConstStaticCell and simply do:

static MATTER: ConstStaticCell<Matter> = ConstStaticCell::new(Matter::new(...));

... what the above would do is that it will reserve a place for the Matter object in the data section of the executable, and the linker startup script will initialize the Matter structure (and other structures in the data section) - upon program startup - with its value. This is possible, because Matter::new is const and so is ConstStaticCell::new. Therefore, the object layout of ConstStaticCell<Matter> is generated at compile time, saved (in the form of a byte sequence) in the linker data elf section, and then upon program startup, the whole data section is copied from the elf (or from flash) into RAM by the linker startup using a simple memcpy (or a memcpy-like) routine.

You can also do all sorts of equilibristics at runtime, and the compiler usually optimizes it. Like (excuse any typos):
```
const MATTER: Matter = Matter::new(...);
```

let boxed_matter = unsafe { let boxed_matter: Box<MaybeUninit> = Box::new_uninit();

// This is optimized by the compiler to do a similar `memcpy` as to when the `data` section is initialized by the linker,
// because we are copying a `const` value into the `*mut Matter` raw pointer.
// Be careful to inline this code and not hide the `MATTER` const behind a function though, as this optimization might 
// not trigger then.
boxed_matter.as_mut_ptr().write(&MATTER);

boxed_matter.assume_init()

}

// boxed_matter is now initialized, without any stack blow-ups ...


### Why are we solving (again) an already solved problem then?

Using `const`-initializers comes with one disadvantage: **flash size**. Since the default value of the `Matter` object is _not_ all 0s (or uninitialized data) - and if you can trust me - it is very, very difficult to come up with a default `Matter` object state which is all 0s and/or `MaybeUninit` (*) - ultimately - the `Matter` object is placed in the `data` section instead of in `bss`, which means the initial layout of the object has to be burned in flash.

Depending on the size of all buffers, number of sessions, number of supported fabrics and so on, the total size of the Matter stack is anywhere between 30KB to 70KB.

Now, this is nothing for embedded Linux scenarios, and is probably not that much for Espressif MCUs which have 4MB flash. But it might be something for other MCUs which generally have less flash size.

(*) Coming up with "all-zeroes" default object state in Rust is very, very difficult. 

Stuff which is a useful default and which is "all-zeroes":
- `false` in `bool`
- 0 in `u8`, `u32`, `u16` etc
- `None` in `Option<NonZeroU8>`, `Option<NonZeroU16>`, etc

Stuff which is a useful default and which is NOT "all-zeroes":
- `""` in `&str` (`&str` is a fat pointer and is NOT modeled - like in C++ - with `(nullptr, 0)`, but with `(alignof(&str), 0)` i.e. it ends up having a binary value of `0x0000000100000000` (on 32bit archs)
- `b""` (empty slice) in `&[u8]` - ditto as for `&str`
- `None` for `Option<Foo>` where `Option<Foo>` is not special-cased by the compiler as in e.g. `Option<NonZeroU8>`. Especially with `Option<Foo>` where `Foo` is an enum, the compiler often plays trick where it models `None` as an extra value in the `Foo`'s enum _tag_, and this extra value in the tag is almost never 0

### _The_ More Important Reason why we are solving it again

Putting aside flash size concerns, we have an existing, unsolved case of stack blow-ups (and excessive stack usage at runtime):
_We are very often allocating large objects on-stack and then moving them in their final destination (which is usually a `heapless::Vec` or an array inside the `Matter` object)._

Cases where we do this:
- `Fabric` (1.7KB) - first allocated on-stack then moved in the vector by `vec.push(fabric)`
- `Session` (~ 1KB currently, can be reduced) - ditto
- `ACL` - ditto
- A lot of other places, like temporary arrays of ~ 1KB inside the Secure Channel implementation

But let's concentrate on `Fabric` and `Session` for now.
In order to avoid their temporary stack allocation (and subsequent move into the vec via `push`), we need to somehow implement a "placement-new" for `heapless::Vec`, and somehow initialize the `Fabric` or `Session` in-place, directly in the memory of the `heapless::Vec` instance. Something like:
```rust
vec.push_in_place(|slot: &mut MaybeUninit<Fabric>| {
    let slot_ptr = slot.as_mut_ptr();

    unsafe {
        addr_of_mut!((*slot_ptr).node_id).write(node_id); 
        addr_of_mut!((*slot_ptr).fabric_id).write(fabric_id); 
        addr_of_mut!((*slot_ptr).fabric_idx).write(fabric_idx); 
        // ... and so on
    }
});

Two problems with the above:

We need a custom heapless::Vec, as the upstream one does not have (yet) a push_in_place method
Using push_in_place requires a lot of care by the caller, as it is full of unsafe code. And very, very verbose!

Can we do better?

Yes. With pinned-init.

( BTW: We'll put aside the pinned- aspect of that crate. All our objects are Unpin and fortunately, pinned-init supports unpinned objects just fine, contrary to its name - without any exposure of the user to the notion of pinning. )

So what does pinned-init do? Grossly oversimplifying, but it allows us to turn:

vec.push_in_place(|slot: &mut MaybeUninit<Fabric>| {
    let slot_ptr = slot.as_mut_ptr();

    unsafe {
        addr_of_mut!((*slot_ptr).node_id).write(node_id); 
        addr_of_mut!((*slot_ptr).node_id).write(fabric_id); 
        addr_of_mut!((*slot_ptr).node_id).write(fabric_idx); 
        // ... and so on
    }
});

into:

vec.push_in_place(Fabric::init());

where Fabric::init() is almost like Fabric::new() except with a slightly different signature and function body:

impl Fabric {
    fn init(node_id: u64, fabric_id: u64, fab_idx: NonZeroU8) -> impl Init<Self> {
        init!(Self {
            node_id,
            fabric_id,
            fab_idx,
            icac <- rs_matter::utils::vec::Vec::init(), // Notice the weird `<-` syntax here and how we compose our `init` with `Vec`s `init`!
        })
    }

The above syntax - thanks to the init! macro provided by pinned-init allows us to use safe, readable, composable syntax to in-place initialize structures recursively! Or rather and more correctly - to express an in-place initializer/constructor that does it. Almost like in C++!

The similarity with regular new() -> Self is on purpose of course.

Can't we apply the `const fn` trick to `Fabric` and `Session`?

Yes, we can. Assuming we implement const fn Fabric::new() -> Self as a parameter-less initial state of the Fabric object, we can try the following:

const INITIAL_FABRIC: Fabric = Fabric::new();

fabric_mgr.fabrics_vec.push(INITIAL_FABRIC)?;
let fabric = fabric_mgr.fabrics_vec.last_mut().unwrap();

// Now modify `fabric` members as we see fit
fabric.node_id = ...

BUT: The problem is fabric_mgr.fabrics_vec.push(INITIAL_FABRIC)?; Even though INITAL_FABRIC is a const, we have no guarantee, that the compiler will (a) inline the call to fabrics_vec.push(...) and (b) it will use a memcpy to initialize the new fabric at the end of the vector with the INITAL_FABRIC const. It might or might not do that, depending on the optimization settings.

In contrastpinned-init always initializes in place, even in opt=0.

`pinned-init` background

This crate (or rather, a copy of it) is used in the RfL (Rust for Linux) project, and is therefore merged in the Linux kernel mainline. So despite the low-star rating on GitHub, I think the approach of the crate is very solid and credible.

Changes delta

The changes suggested here are incremental, and most importantly - additive.

All const fn constructors are preserved, so folks can continue using Matter::new(...) in a const context
Matter (and all other structures recursively downwards) just get an extra init method next to their new constructor functions, which is an almost verbatim copy of the existing new, yet with the large members of the concrete struct using the <- syntax of the pinned_init::init! macro which delegates to the placement-new init methods of the inner structures

Not so ideal stuff:

I had to copy RefCell from Rust core into rs_matter::utils::cell::RefCell as the core RefCell does not have an init placement-new method. However this is temporary. We'll have to anyway get rid of RefCell in future in favor of using a real mutex, which is either no-op for single-threaded scenarios (the default), or a real blocking one. This future mutex or the beginnings of it is available under the new blmutex module which is introduced by this PR as well
I had to copy heapless::Vec as rs_matter::utils::vec::Vec and extend it to (a) have an init in-place constructor and (b) to have the push_inplace method discussed above. Maybe in future, if the pinned-init crate gains traction in the Rust embedded echosystem (and Embassy in particular) we might be able to merge the Vec changes upstream in heapless (and merge our blmutex changes upstream too). But I find both of these not such a big issue.

Alternatives

Option 1: Do nothing

We can continue relying on const fn for the initialization of the Matter object (and pay in increased flash size usage). For Fabric and Session, we can either do the const trick described above (and rely on compiler opt settings to do their job), or we can still fork heapless::Vec and introduce push_in_place but without the convenience of pinned_init::init! we'll have to use a lot of unsafe to in-place initialize the members of the Vec (Fabric and Session and in future possinly others).

Option 2: `&mut`-borrow large `heapless::Vec` instances

(This is what I tried originally.)

We can make the big heapless::Vec instances used in Matter no longer owned by Matter, but rather - borrowed in Matter by &mut references.

I.e. FabricMgr would become FabricMgr<'v> because it would contain a &'v mut heapless::Vec<Fabric> rather than owning that vector (or array) as it is now. Consequently, Matter<'a> would become Matter<'a, 'v>, Exchange<'a> would become Exchange<'a, 'v> and so on down the line. Why the existing covariant 'a lifetime cannot be merged with the new 'v lifetime is explained below.

Advantages:

Simple and obvious change. Solves the "flash size" problem because these large, initially-empty Vecs are the primary contributors to the flash size increase as they constitute the majority of the Matter object "data", even if they initially contain... nothing

Disadvantages:

Matter becomes even less ergonomic: now the user has to separately inject another set of buffers besides PooledBuffers. I.e. this change increases complexity. With the alternative suggested here, we can in fact merge PooledBuffers back into Matter at some point in future, which would reduce complexity
Even worse: the 'v lifetime from above for the &'v mut heapless::Vec external buffers we'll be using is invariant (as it is for a &mut ref). So we can't merge it with the existing, nice covariant 'a lifetime the Matter object has. No matter what we do - I even tried with object pools and whatnot, but ultimately, these are either &mut or RefCell<&Pool> or SomeInteriorMutabilityConstructLikeMutex<&Pool> non-mutable references, and because cells with interior mutability always result in invariant lifetimes, like &mut, we always end up with 'v being invariant and thus unmergable with 'a.
- We can of course hard-code both or either of 'a or 'v to be 'static, but this way we sacrifice the flexibility of Matter on platform like Embedded Linux, where none of the problems discussed here are applicable (stacks on Linux are 2MB by default) and where the user might want to allocate Matter and BasicInfoConfig and DevAttDataFetcher and its mDNS impl and even the buffers on-stack. That would be impossible with 'static.

Why is this PR still a draft?

~~pinned-init is currently still utilizing core::alloc::AllocError which is not in Rust stable yet. (and for no_std this is the only nightly feature it needs).~~

It seems the author is open to changing the crate in a way where it no longer unconditionally depends on core::alloc::AllocError. This error type is anyway only used in the InPlaceInit trait, which is a decorator of Arc, Rc and Box (i.e. the alloc module) which we don't use/need, so perhaps we can just put - in pinned-init - the definition of InPlaceInit (and the usage of core::alloc::AllocError) behind the alloc feature flag?

UPDATE: Problem solved. The relevant PR addressing ^^^ got merged yesterday and we are now using it via the newly released pinned-init V0.0.8.

I also need to thorough-fully test the in-place initialization - ideally - on an MCU. To be happening next week.

ivmarkov commented 3 weeks ago

@kedars @andy31415 Not ready for merging yet, ~~because the current version of pinned-init it depends on requires nightly. Hopefully we can lift this restriction from pinned-init soon~~ because I still need to test the changes end-to-end. But if you can look into the "mini-RFC" in the PR description.

Change is completely additive and incremental, but still important to understand and form an opinion on.

ivmarkov commented 3 weeks ago

Not so important but attached two screenshots:

MatterStack from rs-matter-stack initialized using const fn - you can see 50KB in the data segment
MatterStack initialized with the newly introduced init infrastructure - same 50KB, but now in .bss (as the data being initialized starts its life as MaybeUninit and only when we in-place initialize it with .uninit().init_with(MatterStack::init(...)) it becomes real)! :)

ivmarkov commented 5 days ago

@kedars @andreilitvin In case you are wondering why there is no activity on this PR:

The reason is - before marking it "ready for merge" I need to prototype the follow up PR, which would bear the fruits from this one (hopefully).

In a way, prototyping the subsequent PR would thus de-risk and prove the usefulness of the changes introduced by this one.

Aside from re-working how Fabrics and ACLs are initialized and built-up during commissioning (an exercise primarily aiming at memory savings - particularly stack memory) I'm also working on improving the TLV framework in two major aspects:

Enhancing FromTLV to support with - in addition to the existing from_tlv method - also with a new init_from_tlv which allows to in-place initialize large structures - ala C++. Which is what the current PR is all about. For example, this means we can in-place initialize the Fabric object from its TLV serialized buffer without materializing it on-stack first, and then moving it into the Vec of fabrics.
Reducing the size of our friend - TLVElement (the representation of partially-parsed TLV data) - from 40 bytes (currently, on x64) - to 16 bytes (on x64). (Similar reduction ratio for 32 bit MCU archs.) This one is interesting actually. While it might seem that I'm trying to cut the lawn with nail clippers, this is only at first sight. The problem is, we are full with TLV-from structures (IM ReadReq, SubscribeReq, WriteReq and quite a few in SC too - Cert - to name the worst one) - which are already full with partially-parsed TLV data in the form of TLVElement or TLVArray. So for those (which are created on the stack and moved around quite a bit and oh well, this is unavoidable, as this is their primary use case!) we might see a reduction of 2x if we can reduce TLVElement and TLVArray sizes. Hence why this exercise is important. Cert for example currently weights more than 400 bytes on x64. By removing the static Vec<T, N> instances in it and then by reducing the TLVElement sizes we can go down to something like 112 bytes on x64, which is quite an improvement.

project-chip / rs-matter