Guidance for using on a RAM limited micro controller?

udem-dlteam / ribbit

A portable, compact and extensible Scheme implementation that is fully R4RS compliant. This includes closures, I/O, tail calls, first-class continuations and a Read Eval Print Loop (REPL). The R4RS Scheme REPL fits inside 6.5Kb !

BSD 3-Clause "New" or "Revised" License

483 stars 43 forks source link

Guidance for using on a RAM limited micro controller? #26

Open SignSpice opened 2 years ago

SignSpice commented 2 years ago

Thank you for making ribbit! This project is very cool. I am particularly interested in using this for micro controllers.

The one I'm dealing with for my current project at work has a ARM Cortex M0+ processor and 48k RAM (about 38-40k available) and 144kb ROM.

I had seen this:

Ribbit achieves a 4K footprint, but that just means the size of the Ribbit VM plus the size of the compacted compiled code. The uncompacted code is stored in RAM in the form of linked "ribs" (3 cell objects, or 6/12/24 bytes depending on the word size) and this is not very memory efficient compared to a bytecode representation. So it takes on the order of 64K RAM to run the REPL, which is about 1000 lines of Scheme code. So a rule of thumb is about 64 bytes of RAM per line of Scheme code. You can use that ratio to determine how large of a Scheme program will fit on your specific device.

So if I understand correctly, my ideal of having a REPL on the MCU over UART is out of the question for this MCU. But perhaps, if the memory needed for the VM itself is small enough, you could have the encoded instructions compiled on the computer, sent over UART, and run on the MCU?

That way even if you could not have a true REPL, you would still only need to recompile and restart the instructions for the VM, and not recompile/upload/run the entire C program.

feeley commented 2 years ago

Are you going to implement the RVM in ARM assembly or in C? The implementation of the RVM can easily fit in the ROM.

Concerning RAM memory requirements, the repl-min program (Ribbit REPL with min library) requires about 2K ribs to run. This includes all the ribs to represent the program and min library, about 1KLOC. With a smaller program you would need proportionately fewer ribs (for example a program that does not call eval or repl).

So how much space does a rib occupy? A rib is 3 references. A reference can occupy the number of bits to encode the index of a rib, plus one bit (to differentiate a small integer from a rib pointer). So if your heap needs to store 2K ribs, you need 12 bits (1+log2(2K)). This isn't a very convenient size to manipulate, so round this up to 16 bits per reference, and thus a rib occupies 48 bits (6 bytes). Given that the heap needs to hold 2K ribs, the memory requirement for the heap is 6*2K = 12 KB. So if you have 36 KB of available RAM you could have 6K ribs in your heap, more than enough to run repl-min.

However, for the memory usage to be this small you need to use a mark-sweep GC algorithm (for example the one implemented in the POSIX shell RVM). The GC that is used by the C RVM is a 2 semispace stop-and-copy GC which doubles the memory requirement, so you'll only be able to store 3K ribs in a 36 KB heap... still enough to run repl-min, but it is getting tight.

The idea you propose will save very little space because the main issue is the space taken by the rib representation of the RVM code. I don't think it will help much to compile externally and send the code through the UART. On the other hand, getting rid of the incremental compiler (needed by eval and repl) in the min library will save quite a bit of code. This would have to be tested.

Please report back how it went if you try it out!

SignSpice commented 2 years ago

Are you going to implement the RVM in ARM assembly or in C?

In C.

Thanks for the detailed explanation! I'll keep you posted on how it goes.

If it isn't terribly difficult to adapt an existing mark-sweep GC code, (perhaps this one: https://github.com/munificent/mark-sweep/blob/master/main.c ), then I may try that route first, as an actual repl is definitely the preferred approach.

I'd also be curious what it might take to make a variant of Ribbit use no GC in the style of Carp, but I imagine it would be quite a lot and more than I'd currently be able to invest into.

feeley commented 2 years ago

I looked at Carp and it seems to have a GC... so I'm not sure what you mean.

The mark-sweep code you link to is not ideal because it uses a stack to do the marking, so in the worst case it will require a marking stack that is proportional to the size of the heap. A better algorithm is the Deutsch-Schorr-Waite marking algorithm that only needs 2 bits per rib (these are "free" given each reference in a rib is stored in 16 bits and you really only need around 12 bits). That is implemented in the POSIX shell RVM and it makes for a rather concise GC implementation:

_GC()
{
eval _X$_T=$_S _Y$_T=$_P

_H=$((_H+1)) _I=$_H
while [ $_I -gt 0 ]; do eval _MR$_I=0; _I=$((_I-1)); done

_C=$_F _P=0

while true; do
  eval _I=\$_M$_C
  while [ $_I = 0 ]; do
    eval _M$_C=1
    _B=$_C
    eval _C=\$_X$_C _X$_B=$_P
    _P=$_B
    if [ $_C = ${_C#R} ]; then break; fi
    eval _I=\$_M$_C
  done
  while [ $_P != 0 ]; do
    eval _I=\$_M$_P
    if [ $_I = 1 ]; then
      eval _M$_P=2
      eval _B=\$_X$_P _X$_P=$_C
      eval _C=\$_Y$_P _Y$_P=$_B
      if [ $_C != ${_C#R} ]; then break; fi
    fi
    if [ $_I = 2 ]; then
      eval _M$_P=3
      eval _B=\$_Y$_P _Y$_P=$_C
      eval _C=\$_Z$_P _Z$_P=$_B
      if [ $_C != ${_C#R} ]; then break; fi
    fi
    if [ $_I = 3 ]; then
      eval _B=\$_Z$_P _Z$_P=$_C
      _C=$_P _P=$_B
      break
    fi
  done

  if [ $_P = 0 ]; then break; fi
done

_I=1 _P=$_H
while true; do
  eval _B=\$_MR$_I
  while [ $_B != 0 ]; do _I=$((_I+1)); eval _B=\$_MR$_I; done
  eval _B=\$_MR$_H
  while [ $_B = 0 ]; do _H=$((_H-1)); eval _B=\$_MR$_H; done
  if [ $_I -ge $_H ]; then break; fi
  eval _XR$_I=\$_XR$_H _YR$_I=\$_YR$_H _ZR$_I=\$_ZR$_H _MR$_I=1 _MR$_H=0 _XR$_H=$_I
  _I=$((_I+1)) _H=$((_H-1))
done

_I=$_H _L=$((_H*10+5000))

while [ $_I != 0 ]; do
  eval _C=\$_XR$_I
  if [ $_C != ${_C#R} ]; then
    eval _B=\$_M$_C
    if [ $_B = 0 ]; then eval _XR$_I=R\$_X$_C; fi
  fi
  eval _C=\$_YR$_I
  if [ $_C != ${_C#R} ]; then
    eval _B=\$_M$_C
    if [ $_B = 0 ]; then eval _YR$_I=R\$_X$_C; fi
  fi
  eval _C=\$_ZR$_I
  if [ $_C != ${_C#R} ]; then
    eval _B=\$_M$_C
    if [ $_B = 0 ]; then eval _ZR$_I=R\$_X$_C; fi
  fi
  _I=$((_I-1))
done

eval _S=\$_X$_T _P=\$_Y$_T
}

Note that all the uses of eval are to simulate indirection, i.e. eval _B=\$_M$_C is conceptually _B=_M[_C]

SignSpice commented 2 years ago

Carp has a barrow checker instead of GC, similar to Rust's but not as complicated. This post explains it a little more: http://blog.veitheller.de/Borrow_Checking,_The_Carp_Way.html

Ah, good to know. I really appreciate the redirection, I'm diving head first into a lot of new-to-me things at once, so it is really nice to have a clearer picture of what to focus on.

I'm trying to figure out what most of the short variable names are short for. _XR _YR _ZR -> the three segments of a rib?

SignSpice commented 2 years ago

Found an implementation of Schorr-Waite in C: schorr_waite.h:

typedef struct CEL * ptr;
typedef enum {a, p} typ; // atom or pointer
typedef enum {m, u} mrk; // marked or unmarked
typedef struct CEL { ptr P; typ T; mrk M; } cel;

extern const ptr Null;

extern ptr Memory;

unsigned is_left(ptr);

void Schorr_Waite(cel);

#include "schorr_waite.h"

unsigned is_left(ptr pointer)
  { return ((long unsigned)pointer - (long unsigned)memory) / sizeof(cel) % 2; }

void Schorr_Waite(cel Root)
  { ptr C, C_, P, P_;
    *Memory = Root;                      // *Memory <- Root
    P = Null;                            // P <- Null
    for (C = Memory;; )                  // C <- Memory
      if (C->M == u)                     // Cw = u
        if (C->T == a)                   // Cv = a
          C->M = m;                      // Cw = m
        else                             // Cv = p
          { C_ = C->P;                   // C^
            *C = (cel){ P, p, m };       // *C = [P, p, m]
            P = C;                       // P <- C
            C = C_; }                    // C <- C^
      else                               // Cw = m
        if (P != Null)                   // P ≠ Null
          if (is_left(C))                // left?(C)
            C += 1;                      // C <- C + 1
          else                           // right?(C)
            { P_ = P->P;                 // P^
              *P = (cel){ C - 1, p, m }; // *P = [C - 1, p, m]
              C = P;                     // C <- P
              P = P_; }                  // P <- P^
        else                             // P = Null
          break; }                       // stop

Hopefully this is simple to adapt for use with ribbit. I will work on this some more tomorrow morning when my mind is more fresh.

feeley commented 2 years ago

That's an interesting piece of code. It may be harder than you think to adapt because it handles cells containing a single pointer and Ribbit uses cells (ribs) containing 3 pointers. The code may be a good source of coding ideas though.

To simplify the memory management it may be good to allocate the rib slots in 3 different arrays indexed by the rib number. Also add a 4th array that is used for marking, each element is a byte. You only need 2 bits in this byte for marking using Shorr-Waite, so the 6 remaining bits can be used to attach 2 bits of information to each slot of the ribs, so each rib slot would conceptually contain 16+2 = 18 bits. That would allow having fixnums of 17 bits which cover the range -65536 to 65535 (so even unsigned 16 bit integer arithmetic would be possible). Also, you could fit a 32 bit IEEE 754 floating point number in a rib to implement single precision flonums.

SignSpice commented 2 years ago

I think I have got a bit stuck between my rather shallow understanding of C and my vague understanding of how Schorr-Waite works. I probably have quite a bit more study to do before I'll be able to get this working.

Pasting some interesting links: (This is where I found the source above) https://www.tele-task.de/lecture/video/9162/ And the related files: https://drive.google.com/drive/folders/1GB63eQzqiwPnSHV6AoonPBLGcv6X5dRb

SignSpice commented 2 years ago

Thinking "out-loud" a bit here. My initial thought was to implement the ribs my making each rib-slot a cel.

typedef struct RIB * ptr;
typedef enum {a, p} typ; // atom or pointer
typedef enum {m, u} mrk; // marked or unmarked
typedef struct CEL { ptr P; typ T; mrk M; } cel;
typedef struct RIB { cel X ; cel Y; cel Z; typ T; mrk M; } cel;

This would require a bit of reworking the algorithm and may not even work. Also, if only two bits are needed per rib, this would approach uses a full byte.

My understanding of the current algorithm is that each cell holds a single pointer, and car and cdr are defined as whether the current cell is even or odd in memory. (is_left syn. with is_even or is_car)

(I think) A pointer in any rib-slot would always points to a rib as a whole, (or the start of a rib).

Now, maybe if instead of making the definition of a rib an explicit struct, it was calculated similarly to is_left we may just need to modify the cell_is_marked branch a bit.

If cell position is evenly dividable by 3, we must be the last cel of a rib an can backtrack. Otherwise, shift to the next cel over. Maybe as simple as this?:

#include "schorr_waite.h"

int cell_pos (ptr pointer)
  { return ((long unsigned)pointer - (long unsigned)memory) / sizeof(cel) % 3; }

void Schorr_Waite(cel Root)
{ ptr C, C_, P, P_;
    *Memory = Root;                      // *Memory <- Root
    P = Null;                            // P <- Null
    for (C = Memory;; )                  // C <- Memory
      if (C->M == u)                     // Cw = u
        if (C->T == a)                   // Cv = a
          C->M = m;                      // Cw = m
        else                             // Cv = p
          { C_ = C->P;                   // C^
            *C = (cel){ P, p, m };       // *C = [P, p, m]
            P = C;                       // P <- C
            C = C_; }                    // C <- C^
      else                               // Cw = m 
        if (P != Null)                   // P ≠ Null
          if (cell_pos(C) < 0)           // rib_a_or_b?(C)
            C += 1;                      // C <- C + 1
          else                           // rib_c?(C)
            { P_ = P->P;                 // P^
              *P = (cel){ C - 1, p, m }; // *P = [C - 1, p, m]
              C = P;                     // C <- P
              P = P_; }                  // P <- P^
        else                             // P = Null
          break;} // stop

Screen Shot 2022-09-22 at 10 43 45 AM

I have tried a couple different ways to implement this by spitting up each rib-slot into its own array plus one for marking, but I have not yet been able to wrap my head around how that would work.

dzchoi commented 1 year ago

I appreciate the great work and the detailed paper as well. Thank you!

I think, the two bits required for each rib can come for free if we allocate the bits in the Tag of Scheme object rib, or the opcode of RVM instruction rib, because the Tag and the opcode do not occupy all the bits of an entire object. (In this regard, I would like to have the Tag and the opcode sit at the same position. At the first among the three?)

I am new to the the Deutsch-Schorr-Waite marking algorithm, and I am trying to understand the algorithm. In my short understanding, the algorithm seems like just marking the visited nodes (i.e. ribs) in iterative manner, without having to do the recursive DFS, which would possibly overflow the stack space (due to the nature of binary tree that ribs are structured).

So, the algorithm lets us know which ribs to deallocate (in efficient way without blowing up our stack). However, assuming garbage collection is done, I still wonder how we can allocate a new rib from the garbage-collected (hence fragmented) heap memory. How can we distinguish between active ribs and removed ribs?

Can we overwrite the first object of the removed ribs with NULL, which does not represent a "Scheme integer" nor a valid pointer? (In rvm.c, the global variable alloc always points to the next available memory to allocate, but it may no longer be of help, and do we have to linear-search for a vacant rib from the beginning of heap all the time?)

PS. I really like your idea of distinguishing between literal integers and pointers, by having all pointers aligned at even addresses, hence their LSBs are always zero, though the range of all integers are reduced in half from their normal range (0..2^31 instead of 0..2^32 in 32-bit platform). ^^