miking-lang / miking

Miking - the meta viking: a meta-language system for creating embedded languages
Other
51 stars 31 forks source link

Proposal: An intrinsic Byte data type #562

Open johnwikman opened 2 years ago

johnwikman commented 2 years ago

Currently, there is no data type in MCore for representing a byte. While it might be fine for some applications to only require String based I/O, the absence of binary I/O heavily impacts performance on applications where serialization of data is useful, such as when saving/checkpointing weights during machine learning with neural networks.

The proposal is to have a new intrinsic data type Byte that takes up 1-byte per element. I.e. a tensor tensorCreateDense [n] (lam. #byte"0x00") should take up n + O(1) memory, where the O(1) represents constant overhead for the tensor.

The consequence of this would be that the behavior of a byte becomes completely defined by the backend used, which I would see as favorable as that offloads a lot of the underlying encoding requirements from MCore.

The immediate use case for me is to be able to serialize & deserialize large tensors containing floats. Currently this is not really feasible since I have to use float2string and string2float every time I do file I/O, combined with that I have to parse strings to check for delimiters, etc. Previous attempts that I made with loading tensors using string representations would take days to fully parse the produced strings. In the case of having the Byte type available however, I could instead use more efficient writeTensor and readTensor functions:

let writeTensor: WriteChannel -> Tensor[Float] -> () = lam ch. lam t.
  writeBytes ch (int2bytesbe (tensorRank t));
  foldl (lam. lam dimsize.
    writeBytes ch (int2bytesbe dimsize)
  ) () (tensorShape t);
  let n = tensorSize t in
  recursive let iterH = lam i.
    if eqi i n then () else (
      writeBytes ch (float2bytesbe (tensorLinearGetExn t i));
      iterH (addi i 1)
    )
  in
  iterH 0

let readTensor: ReadChannel -> Tensor[Float] = lam ch.
  -- assuming that floats and ints have the same serialized size regardless of value (might need to have more expressive externals here...)
  let sizeFloat = length (float2bytesbe 0.0) in
  let sizeInt = length (int2bytesbe 0) in
  let rank = bytesbe2int (readBytes ch sizeInt);
  recursive let mkshapeH = lam acc. lam i.
    if eqi i rank then
      ()
    else
      mkshapeH (snoc acc (bytesbe2int (readBytes ch sizeInt)))
               (addi i 1)
  in
  let shape = mkshapeH [] 0 in
  let t = tensorCreateDense shape (lam. 0.0) in
  let n = tensorSize t in
  recursive let fillTensorH = lam i.
    if eqi i n then () else (
      tensorLinearSetExn t i (bytesbe2float (readBytes ch sizeFloat));
      fillTensorH (addi i 1)
    )
  in
  fillTensorH 0;
  t
johnwikman commented 2 years ago

This has been discussed during meetings, and the idea would probably merge into how we handle external types. Related issue: #586

Though it would still be good to have some way to specify size guarantees in MCore, such that a backend cannot hog up arbitrary memory for a value that should be fixed size and small. To solve this, we could introduce the Blob type with can take a number as a parameter, specifying the number of bits that it should occupy:

external type UInt8
external type Int8
external type UInt16
external type Int16
external type BigInt

ffi ocaml
  type UInt8 = Blob[8]
  type Int8 = Blob[8]
  type UInt16 = Blob[16]
  type Int16 = Blob[16]
  type BigInt = Blob

The syntax for Blob would then be