Open johnwikman opened 2 years ago
This has been discussed during meetings, and the idea would probably merge into how we handle external types. Related issue: #586
Though it would still be good to have some way to specify size guarantees in MCore, such that a backend cannot hog up arbitrary memory for a value that should be fixed size and small. To solve this, we could introduce the Blob type with can take a number as a parameter, specifying the number of bits that it should occupy:
external type UInt8
external type Int8
external type UInt16
external type Int16
external type BigInt
ffi ocaml
type UInt8 = Blob[8]
type Int8 = Blob[8]
type UInt16 = Blob[16]
type Int16 = Blob[16]
type BigInt = Blob
The syntax for Blob would then be
Blob[<bits>]
: Fixed size blob that is not able to be resized.Blob
: Arbitrary blob of data to be managed by the backend. Can be resized if necessary.
Currently, there is no data type in MCore for representing a byte. While it might be fine for some applications to only require String based I/O, the absence of binary I/O heavily impacts performance on applications where serialization of data is useful, such as when saving/checkpointing weights during machine learning with neural networks.
The proposal is to have a new intrinsic data type
Byte
that takes up 1-byte per element. I.e. a tensortensorCreateDense [n] (lam. #byte"0x00")
should take up n + O(1) memory, where the O(1) represents constant overhead for the tensor.#byte"0x10"
or#b"0x10
(short-hand version). The restriction in this case would be that the value inside the quotation marks is between 0 and 255 inclusive, not necessarily the representation used.The behavior of a Byte (apart from storage space) would be completely implemented through externals. E.g. there would be no intrinsic
addbyte a b
. For OCaml, we might implement the following externals:external int2bytesbe: Int -> [Byte]
(be = Big Endian)external float2bytesbe: Float -> [Byte]
(could be IEEE754 format or something else, up to the backend to decide)external bytesbe2int: [Byte] -> Int
external bytesbe2float: [Byte] -> Float
external readBytes ! : ReadChannel -> Int -> [Byte]
external writeBytes ! : WriteChannel -> [Byte] -> ()
The consequence of this would be that the behavior of a byte becomes completely defined by the backend used, which I would see as favorable as that offloads a lot of the underlying encoding requirements from MCore.
The immediate use case for me is to be able to serialize & deserialize large tensors containing floats. Currently this is not really feasible since I have to use float2string and string2float every time I do file I/O, combined with that I have to parse strings to check for delimiters, etc. Previous attempts that I made with loading tensors using string representations would take days to fully parse the produced strings. In the case of having the
Byte
type available however, I could instead use more efficientwriteTensor
andreadTensor
functions: