Format Needs Specification

ttocsneb / mbon

Marked Binary Object Notation

3 stars 0 forks source link

Format Needs Specification #1

Open tryoxiss opened 9 months ago

tryoxiss commented 9 months ago

I just stumbled across this. In your README.md you specify its inspired by NBT, and have a brief description of the format. Anything more than this though I was unable to find. If the format is being designed as you go that would be a good thing to note. If its already complete otherwise a more formal specification would be nice to check the ser/de against it.

ttocsneb commented 9 months ago

Thank you for this issue! There definitely isn't much documentation here other than what you have already found--I designed this I haven't touched this repo in quite some time.

If I remember right, it should be fully compatible with ser/de types now and so could be complete. Since I haven't looked at it for a while, now would be a good time for me to take another look at it with fresh eyes and see if there is anything that should be fundamentally changed about the format. Then I will create a formal spec for it.

If you mind me asking, Is this a format you might be interested in? And do you have any suggestions or questions about it?

tryoxiss commented 9 months ago

If you mind me asking, Is this a format you might be interested in? And do you have any suggestions or questions about it?

I certinly might! I was looking at various binary serialisation formats (especially implemented in rust) to see how diffrent people tackle the issues they have (such as length encoding, typing, etc) because I am working on my own format (mostly because I like designing things like that. Mine was also vauguely inspired by NBT!).

I think I would need to know more about the format to really judge, but this is defintely an interesting project that I may use!

As for suggestions I would highly suggest having a "format signature" and a version field for future improvements. A "signature" for a format is just a random but constant series of bytes that are always present at the start to aid in format detection with a high degree of confidence. (For example PNG has decimal 137 80 78 71 13 10 26 10, UTF-8 files use FE FF, etc)

A version field could just be one byte after that that you count up on post-1.0 changes.

As for questions I do have a few! The first one is why are you making a format? No judgement, just curious! Joy of creation, attempt at a binary standard, for a specific neiche like a game your doing?

You say that a types can be more than 1 byte long in ID, how do you know if you need to read more bytes? You also mention it encodes the length, is it using an expanding number format or do types have a limited maximum length?

Also looking at the grammer section in the readme, I would suggest you name types more according to rust (u8, u16, etc) since its generally easier to work with imo. (you have long, int, short etc in ## Specification). Especailly since its a rust format, not a rust adaptation of a new format it makes sense to use that scheme.

EDIT: Char shouldn't blindly be a u8, as it often represents a dynamic-length value of exactly one UTF-8 character, which can be more than one byte.

Also bytes fees redundent, as you can just have an array, which from the little you have seems to be a list of fixed-length homogenous values -- exactly what bytes is, but more generic. (/EDIT)

I feel like I am being a bit too harsh and I really don't bean to come off that way, I just like file formats and want to give some feedback on this neat format your doing!

ttocsneb commented 9 months ago

I feel like I am being a bit too harsh and I really don't bean to come off that way, I just like file formats and want to give some feedback on this neat format your doing!

Don't worry about it, I really appreciate the feedback! This is the first file format that I have made and I am definitely still new with this sort of thing and any feedback is very helpful!

As for questions I do have a few! The first one is why are you making a format? No judgement, just curious! Joy of creation, attempt at a binary standard, for a specific neiche like a game your doing?

At some point, I wanted to make a game inspired by minecraft, but with no substance (imagine all features only coming from mods) The first thing I wanted to do was create a file format that could handle the needs that this game would require. The problem I wanted to solve is "how could I create a single huge world file that I don't need to be all in memory when running the game. Something where I could easily skip around to and load the parts of the file that are being used without needing to parse the whole thing." I later dropped the idea of the game, though I still think it is an interesting concept.

You say that a types can be more than 1 byte long in ID, how do you know if you need to read more bytes?You also mention it encodes the length, is it using an expanding number format or do types have a limited maximum length?

The idea is that a mark always starts with an id. This id would determine how the mark should be read. When the id is l, you know the data type will be an i64. When the id is a, you know the data type will be an array, but there is more that is needed to know about the array: A nested mark, and a length. The number of bytes that are needed to be read in a mark is tied to the id itself. You can also get some pretty crazy marks, say you want an 10 item array of 3 item arrays of strings that are each 5 bytes long: (parens for visual grouping) (a (a (s 00 00 00 05) 00 00 00 03) 00 00 00 0a) is a valid mark.

Every mark must be able to determine how long the data is. For a simple primitive such as an i64 we know it will always be 8 bytes long, whereas for a string, it could be any length so a length specifier is needed to know how long it is. So, the mark s 00 00 00 0a is a string with 10 bytes (In total it is 15 bytes including the mark). There is a maximum length to data types. This maximum length is 2^32-1 or about 4 billion.

I hope this isn't too convoluted of an explanation, please let me know if anything doesn't make sense

After looking through the spec, I'm thinking about making some changes to the type system. To make it so that there could be types with different size lengths. Instead of s, there could be four ids: 24 25 26 27 for a string.

Byte = { '\x00'..'\xff' }
Size8 =  { Byte }
Size16 = { Byte{2} }
Size32 = { Byte{4} }
Size64 = { Byte{8} }

MarkStr8 =  { "\x24" ~ Size8 }  // 0x24 | 0b00
MarkStr16 = { "\x25" ~ Size16 } // 0x24 | 0b01
MarkStr32 = { "\x26" ~ Size32 } // 0x24 | 0b10
MarkStr64 = { "\x27" ~ Size64 } // 0x24 | 0b11

With a Str8, there can only be up to 255 bytes while with a Str64, it can have up to 18 quintillion bytes. This would save up on space for small values (Which is negligible, but maybe worth it when you have lots of small values). I can't decide whether this is overkill or not.

Which, I'm not sure what you mean by "expanding number format", could you enlighten me?

Char shouldn't blindly be a u8, as it often represents a dynamic-length value of exactly one UTF-8 character, which can be more than one byte.

You're right, I believe that in the code it checks if the char can be represented by a u8, and if not, it uses a u32. I like the idea better to use a dedicated char type though.

Also bytes fees redundent, as you can just have an array, which from the little you have seems to be a list of fixed-length homogenous values -- exactly what bytes is, but more generic.

I guess my thinking was that there is some semantic difference where bytes would be treated like a buffer and a byte array would be treated more like an array of individual bytes--if that makes any sense. Looking back now, I see what you mean. It would probably be better to not multiple ways to represent the same thing and getting rid of bytes might be better.

I do think that there are certain types that are definitely redundant, but make sense as they should be treated differently, such as maps.

As for suggestions I would highly suggest having a "format signature" and a version field for future improvements.

Definitely, I really like this idea and plan on implementing it.

I have already started to work on some documentation as well as some changes to the format. I haven't uploaded anything yet, but plan to do so on another branch once I get to a good checkpoint.

One of the bigger things I'm thinking to add is to have a pointer system so that you could have arrays of strings without needing to all be the same length. I feel like this might over complicate the format and make it hard to serialize and deserialize files.

tryoxiss commented 9 months ago

Don't worry about it, I really appreciate the feedback! This is the first file format that I have made and I am definitely still new with this sort of thing and any feedback is very helpful!

At some point, I wanted to make a game inspired by minecraft, but with no substance (imagine all features only coming from mods) The first thing I wanted to do was create a file format that could handle the needs that this game would require. The problem I wanted to solve is "how could I create a single huge world file that I don't need to be all in memory when running the game. Something where I could easily skip around to and load the parts of the file that are being used without needing to parse the whole thing." I later dropped the idea of the game, though I still think it is an interesting concept.

OOh, so basically a voxel engine? Thats cool!

The idea is that a mark always starts with an id. This id would determine how the mark should be read. When the id is l, you know the data type will be an i64. When the id is a, you know the data type will be an array, but there is more that is needed to know about the array: A nested mark, and a length. The number of bytes that are needed to be read in a mark is tied to the id itself. You can also get some pretty crazy marks, say you want an 10 item array of 3 item arrays of strings that are each 5 bytes long: (parens for visual grouping) (a (a (s 00 00 00 05) 00 00 00 03) 00 00 00 0a) is a valid mark.

Every mark must be able to determine how long the data is. For a simple primitive such as an i64 we know it will always be 8 bytes long, whereas for a string, it could be any length so a length specifier is needed to know how long it is. So, the mark s 00 00 00 0a is a string with 10 bytes (In total it is 15 bytes including the mark). There is a maximum length to data types. This maximum length is 2^32-1 or about 4 billion.

Hmmm ok! I don't think I fully understand still but that still clears it up a lot. So 4GiB is the exact maximum size for a file then.

I hope this isn't too convoluted of an explanation, please let me know if anything doesn't make sense

A bit but its okay, its difficult to explain binary formats!

After looking through the spec, I'm thinking about making some changes to the type system. To make it so that there could be types with different size lengths. Instead of s, there could be four ids: 24 25 26 27 for a string.
Byte = { '\x00'..'\xff' }
Size8 =  { Byte }
Size16 = { Byte{2} }
Size32 = { Byte{4} }
Size64 = { Byte{8} }

MarkStr8 =  { "\x24" ~ Size8 }  // 0x24 | 0b00
MarkStr16 = { "\x25" ~ Size16 } // 0x24 | 0b01
MarkStr32 = { "\x26" ~ Size32 } // 0x24 | 0b10
MarkStr64 = { "\x27" ~ Size64 } // 0x24 | 0b11
With a Str8, there can only be up to 255 bytes while with a Str64, it can have up to 18 quintillion bytes. This would save up on space for small values (Which is negligible, but maybe worth it when you have lots of small values). I can't decide whether this is overkill or not.

Which, I'm not sure what you mean by "expanding number format", could you enlighten me?

An expanding format is a number that dynamically grows in size to accomodate larger numbers. Basically, you have one byte to which the last 7 bits are a number, and if the left most bit is set you read the next byte in small endian (to save on backtracking). So 0 0000101 = 5 length, but 1 0010000 : 0 0000101 is 16 + (because the lead 1 is set) is (256 + 0 + 1024), (since this leads with a zero we don't need to read the next byte) or 1280 length plus the 16 length. sp 1296 bytes total. This allows for files/types of theretically infanite size, and using a number that expands to exactly the size needed.

It is also defintely overkill if you can have strings longer than the max length of the file.

You're right, I believe that in the code it checks if the char can be represented by a u8, and if not, it uses a u32. I like the idea better to use a dedicated char type though.

A dedicated char type is defintely good!

I guess my thinking was that there is some semantic difference where bytes would be treated like a buffer and a byte array would be treated more like an array of individual bytes--if that makes any sense. Looking back now, I see what you mean. It would probably be better to not multiple ways to represent the same thing and getting rid of bytes might be better.

I do think that there are certain types that are definitely redundant, but make sense as they should be treated differently, such as maps.

What is map redundent with?

As for suggestions I would highly suggest having a "format signature" and a version field for future improvements.

Definitely, I really like this idea and plan on implementing it.

I have already started to work on some documentation as well as some changes to the format. I haven't uploaded anything yet, but plan to do so on another branch once I get to a good checkpoint.

Do keep me updated!

One of the bigger things I'm thinking to add is to have a pointer system so that you could have arrays of strings without needing to all be the same length. I feel like this might over complicate the format and make it hard to serialize and deserialize files.

It certinly might, but if they are transparent pointers it wouldn't be more complicated to the user, just the implementer. It is entirely dependent on the formats goals if pointers would be good or not, but keep in mind without them you may need to either reserved uneeded data or rewrite large chunks if a minor change is made due to using only run length encoding.

For example making a string one character longer would need to rewrite everything after that, unless space is reserved (which may not get used).

ttocsneb commented 9 months ago

Hmmm ok! I don't think I fully understand still but that still clears it up a lot. So 4GiB is the exact maximum size for a file then.

Sort of, there isn't an explicit size limit for files, just types. This is one of the things that I am wanting to clarify with my new documentation. The file itself is an implicit schema type (in the current docs, a object type). So there may be more than one root item, but the serializer/deserializer needs to know ahead of time what the contents of the file are going to be. If I wanted to save a struct to this file format, say

struct MyStruct {
    name: String,
    value: u64,
    my_map: HashMap<String, u32>,
}

It wouldn't make much sense to store this struct as a map since the keys are constant, so having a schema of Str, U64, Map which could look something like <header> (s 00 00 00 05 h e l l o) (l ff ff ff ff ff ff ff ff) (M 00 00 00 0b ((s 00 00 00 01 k) (i 00 00 00 05)))

An expanding format is a number that dynamically grows in size to accomodate larger numbers.

That's so cool! That reminds me of how UTF-8 deals with code points larger than 8 bits. I think this is much better than my idea of having the id determine how many bytes are in a length. Using this system, then there wouldn't be any item size limits and small values wouldn't have wasted space. There is a computation cost to this since we would need to do some more parsing to get the length, but it feels like an unsubstantial cost.

What is map redundent with?

My thinking is that there are a few types that have the exact same mark and content aside from the id. Maps and Lists are redundant to each other in the same way a square is a rectangle, but a rectangle is not a square.

A list has a length and the data is just a sequence of items. A map has a length and the data is also just a sequence of items. There is a difference, but only in interpretation. A map needs to have an even number of items and it is parsed into key-value pairs. Though I think this example is a bit contrived.

It certinly might, but if they are transparent pointers it wouldn't be more complicated to the user, just the implementer. It is entirely dependent on the formats goals if pointers would be good or not, but keep in mind without them you may need to either reserved uneeded data or rewrite large chunks if a minor change is made due to using only run length encoding.

My thinking behind pointers are to make it so that you could have dynamically sized items such as strings as keys in dicts. Making additions to the file is something that I have had passing thoughts on, but never really paid much attention to. Having a more robust pointer system under the hood along with some sort of pseudo heap could be very useful for applications that use one big file and make many changes to it. This would definitely be a feat to implement, but I think doesn't hurt to have in the spec for those who need the I/O performance.

As I think about it more, I am imagining having a few engines which could handle mbon depending on your needs. One would just serialize/deserialize the whole thing without much thought, such as when using serde. Another could be a read-only engine which doesn't read the whole thing into memory, just what is requested (possibly having a max cache size). And another would be a more advanced engine which can read and write and try to be more efficient with the reads/writes using the implicit pointers to its advantage.

Which, thank you for your input! I feel like this format is turning into something that is filling a niche I haven't seen before. I don't think I could have ever come up with these ideas without your help.

tryoxiss commented 9 months ago

Sort of, there isn't an explicit size limit for files, just types. This is one of the things that I am wanting to clarify with my new documentation. The file itself is an implicit schema type (in the current docs, a object type). So there may be more than one root item, but the serializer/deserializer needs to know ahead of time what the contents of the file are going to be. If I wanted to save a struct to this file format, say
struct MyStruct {
    name: String,
    value: u64,
    my_map: HashMap<String, u32>,
}

Ahh, most formats use an explicit type, or at least explicit length in the same format.

It wouldn't make much sense to store this struct as a map since the keys are constant, so having a schema of Str, U64, Map which could look something like <header> (s 00 00 00 05 h e l l o) (l ff ff ff ff ff ff ff ff) (M 00 00 00 0b ((s 00 00 00 01 k) (i 00 00 00 05)))

Yeah it dosent make sense to store constant data like that with a few exceptions like format signatures.

That's so cool! That reminds me of how UTF-8 deals with code points larger than 8 bits. I think this is much better than my idea of having the id determine how many bytes are in a length. Using this system, then there wouldn't be any item size limits and small values wouldn't have wasted space. There is a computation cost to this since we would need to do some more parsing to get the length, but it feels like an unsubstantial cost.

Yeah! Thats actually what inspired me to come up with it. I haven't tested it for cost, but it would essentially be the length of a BIN_AND, JNO (Jump Not One), and addition per extra byte, which is very marginal given the space savings.

My thinking is that there are a few types that have the exact same mark and content aside from the id. Maps and Lists are redundant to each other in the same way a square is a rectangle, but a rectangle is not a square.

Looking at the docs that there are lists seem to be more or less analgus to a rust Vec, and maps to a HashMap<String, ValueEnum>, so I am not quite sure how they are analogus. I suppose it could be a list of tuples of (String, ValueEnum) though.

Looking at it again, map and dict seem to serve the same pourpose. One mentions its len but all types encode run length, right? Looking at a later answer,

A list has a length and the data is just a sequence of items. A map has a length and the data is also just a sequence of items. There is a difference, but only in interpretation. A map needs to have an even number of items and it is parsed into key-value pairs. Though I think this example is a bit contrived.

They serve diffrent enough pourposes, so I see no issue with it. IT also allows for list-specific and map-specific optimisation.

My thinking behind pointers are to make it so that you could have dynamically sized items such as strings as keys in dicts. Making additions to the file is something that I have had passing thoughts on, but never really paid much attention to. Having a more robust pointer system under the hood along with some sort of pseudo heap could be very useful for applications that use one big file and make many changes to it. This would definitely be a feat to implement, but I think doesn't hurt to have in the spec for those who need the I/O performance.

Yeah, though it is generally best to avoid large files and to seperate them out. Pointers could be somewhat simple, actually. One (possibly nieve) way of doing it is just an offset of bytes from the start of the file. From there the format is self describing so you can backtrack a bit to ensure its the start of an object.

As I think about it more, I am imagining having a few engines which could handle mbon depending on your needs. One would just serialize/deserialize the whole thing without much thought, such as when using serde. Another could be a read-only engine which doesn't read the whole thing into memory, just what is requested (possibly having a max cache size). And another would be a more advanced engine which can read and write and try to be more efficient with the reads/writes using the implicit pointers to its advantage.

I think its a fine line between the serde one you mention and the efficentcy one, so I think they should probably be the same under the hood with some optimisation level flag (simillar to rustc, like "Optimize for maximum perfomance", "optimise a bit for performance and a bit for size", etc). The read only one is defintely good though!

Which, thank you for your input! I feel like this format is turning into something that is filling a niche I haven't seen before. I don't think I could have ever come up with these ideas without your help.

No problem! I love file formats and design of all sorts so its been fun to bounce ideas off someone else who also likes them :3

ttocsneb commented 9 months ago

Yeah it dosent make sense to store constant data like that with a few exceptions like format signatures.

I'm not sure what you mean by that. Are you agreeing with me that it doesn't make sense to store the strings of a struct, or that way I have structs stored is a bit funky since we need to know ahead of time what the structure is?

Thinking about it a bit more, I do feel like having a dedicated schema type isn't all that good when it really is just a list under the hood. I've decided to drop schemas/objects in favor of the struct deciding for itself how it should be serialized.

I do plan to keep the the root node to be an implicit list type where the size of the list is the size of the file. It just starts after the header. There will be a few hidden types that will only ever be visible to the serializer/deserializer which I'm calling pointers, reserverd, and heap.

pointers point to a location in the file
reserved data is just padding where the contents are ignored
heap is a sequence of values that don't get read directly (only from pointer references)

The idea is that after the root items could be some extra space that is reserved for pointer values. This could be used in such a way that a more advanced engine could make changes without the need to shift everything over for every change.

header
map:
  str("b"): list:
  - str("Hello World")
  - str("bar")
  str("a"): u8(123)

b[0] might get changed to "foo". Instead of having to move everything around, some padding could be inserted.

header
map:
  str("b"): list:
  - str("foo")
  - reserved
  - str("bar")
  str("a"): u8(123)

If b[1] was to be changed to "cheese". We could instead add data to the end of the file and put a pointer where it should have been.

header
map:
  str("b"): list:
  - str("foo")
  - reserved
  - ptr(b1_addr)
  - reserved
  str("a"): u8(123)
heap:
    b1_addr: str("cheese")

At any point, the data could be defragmented into a more clean form. Though with this example, "cheese" could probably have fit in the list by using the reserved space--This would of course require a relatively complex engine to make work, but I think is good to plan for either way. And this sort of I/O optimization shouldn't happen unless more than some number of sectors will need to be altered; such as with a small change at the beginning of a 1GB file.

I suppose [a map] could be a list of tuples of (String, ValueEnum) though.

Right, my thoughts were with the data encoding rather than the rust counterparts.

From there the format is self describing so you can backtrack a bit to ensure its the start of an object.

One of the problems with this format, which is in its nature is that it is only forward self-describing. Looking forward through a file, I can find each and every item, but when moving backwards, you can't be sure if you are at the start of an item, or in the middle of a recursive mark, or in some random binary data that looks like a mark.

I'm thinking that the engine will have a cache of all known mark locations which take care of that problem.

tryoxiss commented 9 months ago

I'm not sure what you mean by that. Are you agreeing with me that it doesn't make sense to store the strings of a struct, or that way I have structs stored is a bit funky since we need to know ahead of time what the structure is?

Oh i'm just agreeing that storing the implicit map at the root is not a good idea.

Thinking about it a bit more, I do feel like having a dedicated schema type isn't all that good when it really is just a list under the hood. I've decided to drop schemas/objects in favor of the struct deciding for itself how it should be serialized.

What is the scemea type? I agree that object seemed weird and redundent though, as it was redundent with the already redundent list and map.

I do plan to keep the the root node to be an implicit list type where the size of the list is the size of the file. It just starts after the header. There will be a few hidden types that will only ever be visible to the serializer/deserializer which I'm calling pointers, reserverd, and heap.
* pointers point to a location in the file

* reserved data is just padding where the contents are ignored

* heap is a sequence of values that don't get read directly (only from pointer references)

Seems good! You would probably want some "END OF MAIN" marker for denoting where the main content ends and where the heap begins. Reserved data could just be all 0 bytes until the next definition. It would know its reserved space sicne its after the types run length and before the next item is defined.

One suggestion I have is a is_value_pointer() method, for normal stuff it would be invisible but a program to edit it could show a symbol or something to communicate that the data isnt actually stored there.

The idea is that after the root items could be some extra space that is reserved for pointer values. This could be used in such a way that a more advanced engine could make changes without the need to shift everything over for every change.

// -- snip --

If b[1] was to be changed to "cheese". We could instead add data to the end of the file and put a pointer where it should have been.

Or, in this case, even better is you could use the reserved space from the previous value to keep it inline without needing to shuffle anything.

At any point, the data could be defragmented into a more clean form.

It feels like we are inventing a filesystem at this point lol.

Though with this example, "cheese" could probably have fit in the list by using the reserved space--This would of course require a relatively complex engine to make work, but I think is good to plan for either way.

Ahh you saw that too. For simple cases like this I think it would be relatively simple.

if new_value.len() == old_value.len() { /* slot it */ }

if new_value.len() > old_value.len()
{
    // this is the only addition
    if (new_value.left_pad_size() + old_value.len()) >= new_value.len()
    {
       /* Slot */
       return;
    }

    heap_alloc();
}

And this sort of I/O optimization shouldn't happen unless more than some number of sectors will need to be altered; such as with a small change at the beginning of a 1GB file.

If we need to do more searching yeah, but if its just looking at the value lengths and the padding it had before I think its something that should be done always.

From there the format is self describing so you can backtrack a bit to ensure its the start of an object.

One of the problems with this format, which is in its nature is that it is only forward self-describing. Looking forward through a file, I can find each and every item, but when moving backwards, you can't be sure if you are at the start of an item, or in the middle of a recursive mark, or in some random binary data that looks like a mark.

I think I was meaning like backtracking, checking if it describes the next item, backtrack more, etc. That is a pretty bad approach though looking back at it. What could be done instead if check it its self descibing, if not reject it as invalid.

I'm thinking that the engine will have a cache of all known mark locations which take care of that problem.

Thats good for frquently used items or things that have shown up, but it would be a lot to parse an entire file, and then need to re-parse it using the cache to check if they line up.

ttocsneb commented 9 months ago

What is the scemea type? I agree that object seemed weird and redundent though, as it was redundent with the already redundent list and map.

Sorry, When I started making the spec, I renamed object to schema. They are the exact same type, just a different name.

Seems good! You would probably want some "END OF MAIN" marker for denoting where the main content ends and where the heap begins. Reserved data could just be all 0 bytes until the next definition. It would know its reserved space sicne its after the types run length and before the next item is defined.

I like that idea, it could either be a single mark that separates the main data from the heap or it could be a single heap item which could contain reserved data and pointer values.

I feel like with reserved data being all 0, then if there were a large section of reserved data, then we would have to iterate through it to find the next item. I think that the reserved data should be cleared to 0, but it doesn't have to be. Say if there was a large value that got moved to the heap, we would just mark the unused section as reserved so we don't have to spend time resetting the unused data.

One suggestion I have is a is_value_pointer() method, for normal stuff it would be invisible but a program to edit it could show a symbol or something to communicate that the data isnt actually stored there.

That's good, it could also be done in reverse where a program could request that an item is always stored as a pointer.

It feels like we are inventing a filesystem at this point lol.

Not only that but also a whole memory management system hahahaha. A few times, I got ideas for managing space that would involve creating a garbage collector of sorts (Reusing pointers for duplicate values, but how to go about freeing them if nothing was pointing to them anymore).

That or we are designing a json-like database system.

If we need to do more searching yeah, but if its just looking at the value lengths and the padding it had before I think its something that should be done always.

For sure, there has been lots of research about how to more efficiently structure memory. I wonder if any of that would be helpful here. I also wouldn't be surprised if this sort of thing has come up in database design, so that would also be a good place to look.

I think I was meaning like backtracking, checking if it describes the next item, backtrack more, etc. That is a pretty bad approach though looking back at it. What could be done instead if check it its self descibing, if not reject it as invalid.

I'm thinking that the engine will have a cache of all known mark locations which take care of that problem.

Thats good for frquently used items or things that have shown up, but it would be a lot to parse an entire file, and then need to re-parse it using the cache to check if they line up.

I see what you mean. I was thinking that with a cache, it would store only what it has come across rather than parsing the whole thing.

There could be a cache size limit so that if you had memory limitations, then the fewer used cached items would get dropped first if the cache reached its limit. If there is no cache limit, then the whole file could eventually get cached into memory (Say with a large game save file, rather than having a long loading screen getting everything into memory, you would instead load only what you needed, then as you play the game, it would eventually load the entire file into memory and the disk wouldn't need to be touched again.)

Assuming that there is only one engine allowed to touch the file (A lock system could be used to prevent other processes from touching the file), we wouldn't need to compare the cache with the file nearly as often.

Side note, I'm pretty sure that most read/write operations happen at the sector level which are 512 bytes long. The kernel will cache the sector that was read from to prevent unnecessary reads from disk. I would imagine that the engine, if asked to read a specific item could read the whole sector that contains that item and cache everything it just read it to prevent unnecessary system calls.

tryoxiss commented 9 months ago

Sorry, When I started making the spec, I renamed object to schema. They are the exact same type, just a different name.

Ahh. Where is the spec?

I feel like with reserved data being all 0, then if there were a large section of reserved data, then we would have to iterate through it to find the next item. I think that the reserved data should be cleared to 0, but it doesn't have to be. Say if there was a large value that got moved to the heap, we would just mark the unused section as reserved so we don't have to spend time resetting the unused data.

Thats probably better yeah, saves disk rw.

One suggestion I have is a is_value_pointer() method, for normal stuff it would be invisible but a program to edit it could show a symbol or something to communicate that the data isnt actually stored there.

That's good, it could also be done in reverse where a program could request that an item is always stored as a pointer.

Ooh yeah! So a method to get if the value is a pointer, and the abbility to specifically request data be heap or stack allocated rather than the parser figuring it out on its own.

It feels like we are inventing a filesystem at this point lol.

Not only that but also a whole memory management system hahahaha.

That or we are designing a json-like database system.

It really does. Hey, I i'm not complaining this is great.

A few times, I got ideas for managing space that would involve creating a garbage collector of sorts (Reusing pointers for duplicate values, but how to go about freeing them if nothing was pointing to them anymore).

I feel like garbage collectors may not be the right idea, as it could be run only when the file is cleaned up (defragmented + garbage collected in this case). We could also refrence count it (store a number representing the number of things that refrence that) at the start of each heap object. That would be faster, but take up extra space.

For sure, there has been lots of research about how to more efficiently structure memory. I wonder if any of that would be helpful here. I also wouldn't be surprised if this sort of thing has come up in database design, so that would also be a good place to look.

Oh probably! This feels very much like a database format as you mentioned so there is probably a lot of wisdom there.

I think I was meaning like backtracking, checking if it describes the next item, backtrack more, etc. That is a pretty bad approach though looking back at it. What could be done instead if check it its self descibing, if not reject it as invalid.

I'm thinking that the engine will have a cache of all known mark locations which take care of that problem.

Thats good for frquently used items or things that have shown up, but it would be a lot to parse an entire file, and then need to re-parse it using the cache to check if they line up.

I see what you mean. I was thinking that with a cache, it would store only what it has come across rather than parsing the whole thing.

Ahh! Then yeah that would defintely be good!

There could be a cache size limit so that if you had memory limitations, then the fewer used cached items would get dropped first if the cache reached its limit. If there is no cache limit, then the whole file could eventually get cached into memory (Say with a large game save file, rather than having a long loading screen getting everything into memory, you would instead load only what you needed, then as you play the game, it would eventually load the entire file into memory and the disk wouldn't need to be touched again.)

There always needs to be a limit, we don't want to consume the users entire RAM if they make a mistake! If no limit is specified we could just set it to some reasonable amount, say (User Total Ram / 2) or (Program Preallocated Ram - 0.5GB (for the engine)).

Side note, I'm pretty sure that most read/write operations happen at the sector level which are 512 bytes long. The kernel will cache the sector that was read from to prevent unnecessary reads from disk. I would imagine that the engine, if asked to read a specific item could read the whole sector that contains that item and cache everything it just read it to prevent unnecessary system calls.

I think its hardware dependent on if it does or not. I think the linux kernel will always request the whole sector, but some other OSes only request what they need and leave it up to the hardware to read the whole sector or not. Either way, could defintely be a good optimisation. Would want to benchmark that though, and maybe not do it if a lot of operations are already requested.

ttocsneb commented 9 months ago

Ahh. Where is the spec?

I've been working on it, and I just uploaded something under spec. I'm actively working on it and now that I have pushed the first iteration of it, I should be able to make changes pretty frequently.

Thats probably better yeah, saves disk rw.

I figured it would actually be good to go both ways. Since a reserved section is a minimum of 3 bytes, if there needed to be padding less than that, then all \0 would be good.

I feel like garbage collectors may not be the right idea, as it could be run only when the file is cleaned up (defragmented + garbage collected in this case). We could also refrence count it (store a number representing the number of things that refrence that) at the start of each heap object. That would be faster, but take up extra space.

Definitely, I created a new type called the rc which is a receiver to pointers. With some other changes that I made, Pointers can only point to rc's and I'm thinking that with a single heap item, then pointers can point within the heap space rather than the file, so that then if the heap had to move, it could move without needing to update the pointers within the data. (Though that might throw a wrench into things with the main content being allowed to grow into reserved heap space)

There always needs to be a limit, we don't want to consume the users entire RAM if they make a mistake! If no limit is specified we could just set it to some reasonable amount, say (User Total Ram / 2) or (Program Preallocated Ram - 0.5GB (for the engine)).

For sure

I think its hardware dependent on if it does or not. I think the linux kernel will always request the whole sector, but some other OSes only request what they need and leave it up to the hardware to read the whole sector or not. Either way, could defintely be a good optimisation. Would want to benchmark that though, and maybe not do it if a lot of operations are already requested.

That's getting me thinking to some of the details on how an engine might work. I love the idea of having both an async and sync implementation so that those who do not want to use async don't have to have tokio as a dependency, maybe_async is a good crate that allows for this sort of thing. All the business logic gets written as if it is async, but when compiled, a proc-macro will build out the real implementation depending on the feature that is selected.

Behind the scenes there could be a thread that handles read/write operations in a queue and the user would simply make async requests for the operations that need to be done which would put those requests into the queue and wait for it to be complete. Pre-cached requests would be able to skip the queue, but still probably have to wait for a mutex.

tryoxiss commented 9 months ago

I've been working on it, and I just uploaded something under spec. I'm actively working on it and now that I have pushed the first iteration of it, I should be able to make changes pretty frequently.

Okay! I took a look at it and noticed the signature contains "mbn" in ascii, feels like an oppourtunity to have part of it say "mbon".

I figured it would actually be good to go both ways. Since a reserved section is a minimum of 3 bytes, if there needed to be padding less than that, then all \0 would be good.

Why would it take 3 bytes? Having it be all 0's wouldn't really help, since that could just be part of the value. It would need to know its after and before the next bit anyway, which it would know beacause its after the run length before the next type marker (we would need to 0 or otherwise modify it if the gap does have a vakid type marker).

Definitely, I created a new type called the rc which is a receiver to pointers. With some other changes that I made, Pointers can only point to rc's and I'm thinking that with a single heap item, then pointers can point within the heap space rather than the file, so that then if the heap had to move, it could move without needing to update the pointers within the data. (Though that might throw a wrench into things with the main content being allowed to grow into reserved heap space)

The thing with that though, is that now we need to know the location of all heap pointer recievers. So whenever there is a pointer we havent cached yet, we need to read the heap until we find it, which is slow.

Comments on the spec thus far:

The first thing I note is that you seem to be using a specific format to denote formal grammer, which is good, but you should note which one it is as there are many.
After looking through, I noticed that there are 3 char types. I understand the rationale, but since UTF codepoints self-describe length with the continue bit I do wonder if its necesary. It may be good even if its not, though.
For strings, you should note that L is the expanding format described above (assuming it is). (actually same with most types). You should also use a consistant letter for lengths where possible, L is good and already used in most places.
I would rename Dict to Struct or simillar, as it is a fairly ridgid structure.
A lot of types are split into Small, (standard), and Large variants. Personally, I think this is a simillar thing toi numbers so it should be more liek Enum8, Enum16, Enum32 etc as it is easier to reason about and leaves room for expansion.
A lot of types are represented by an ASCII letter. Due to the fact that many types are statically sized and many are dynamic, using binary data may be better with a bit pattern like S TTTT LLL. S is 1 if its statically sized, T is the type, and L is the length (could be reduced to 2). So for example Small Char is 8 bits, static. So its 1. Then lets assign it an id, say 0111. Now its length, its 8-bits or one byte, so lets give it that length. 001. So the full ID is 1011 1001. The last two could even just count, so 32 would be size 3, or 2^3 so its 011, or 3 (for size 3 or 2^3). Just a suggestion. I think it would add nice semantics to the binary that could be used for parsers while still allowing types to be parsed the same way (aka match byte value).

I think its hardware dependent on if it does or not. I think the linux kernel will always request the whole sector, but some other OSes only request what they need and leave it up to the hardware to read the whole sector or not. Either way, could defintely be a good optimisation. Would want to benchmark that though, and maybe not do it if a lot of operations are already requested.

That's getting me thinking to some of the details on how an engine might work. I love the idea of having both an async and sync implementation so that those who do not want to use async don't have to have tokio as a dependency, maybe_async is a good crate that allows for this sort of thing. All the business logic gets written as if it is async, but when compiled, a proc-macro will build out the real implementation depending on the feature that is selected.

If thats only a build dependency and adds little to no overhead that would be great! The big worry with that is often when people dont want to use async its because of the overhead it comes with, and turning async code into sync code may keep much of that overhead if not done right.

Behind the scenes there could be a thread that handles read/write operations in a queue and the user would simply make async requests for the operations that need to be done which would put those requests into the queue and wait for it to be complete. Pre-cached requests would be able to skip the queue, but still probably have to wait for a mutex.

Im too tired to think about this maningfully right now but it sounds good.

ttocsneb commented 9 months ago

Okay! I took a look at it and noticed the signature contains "mbn" in ascii, feels like an oppourtunity to have part of it say "mbon".

That's fair, I was mostly trying to use the protections that png uses.

Why would it take 3 bytes? Having it be all 0's wouldn't really help, since that could just be part of the value. It would need to know its after and before the next bit anyway, which it would know beacause its after the run length before the next type marker (we would need to 0 or otherwise modify it if the gap does have a vakid type marker).

The reserved type has an id and a size indicator, both a minimum of 2 bytes. If you were to set the size indicator to 0, then the reserve items total size would be 2 bytes long (I was thinking of having a minimum of 1 byte of reserved content which would make it 3 bytes.) If there was a change that reduced the size of an item by 1 byte. We would have to move everything by 1 byte, either to make space for the reserved, or to remove the padding. With an empty item, then we could just set that extra padding to \0 and it would work since the mark for empty is \0 and there is no content associated with it.

The idea is to use empty only when necessary, otherwise reserved is preferred.

The thing with that though, is that now we need to know the location of all heap pointer recievers. So whenever there is a pointer we havent cached yet, we need to read the heap until we find it, which is slow.

I don't think we would need to know the location of each receiver when the addresses are relative to the heap, just where the heap is. And with the nature of the format, it should be pretty quick to find the heap.

After thinking about it though, I think there are more benefits to having pointers be relative to the file rather than to the heap.

The first thing I note is that you seem to be using a specific format to denote formal grammer, which is good, but you should note which one it is as there are many.

Right, I'm using the pest grammar. I will add it to the spec.

After looking through, I noticed that there are 3 char types. I understand the rationale, but since UTF codepoints self-describe length with the continue bit I do wonder if its necesary. It may be good even if its not, though.

I thought about it, but I feel like it goes against the philosophy of the type system, where I would need to encode the value of the character in the mark itself in order to know how long the character is. So a size indicator would be needed, and at that point you have a string.

For strings, you should note that L is the expanding format described above (assuming it is). (actually same with most types). You should also use a consistant letter for lengths where possible, L is good and already used in most places.

I should explain in the spec why I use the letters I use: L is for the length in bytes and N is for the number of items. I will change the mark I in array to V to be more consistent with mark names.

I would rename Dict to Struct or simillar, as it is a fairly ridgid structure.

That sounds good

A lot of types are split into Small, (standard), and Large variants. Personally, I think this is a simillar thing toi numbers so it should be more liek Enum8, Enum16, Enum32 etc as it is easier to reason about and leaves room for expansion.

Thanks, I was in a different head-space when I thought of this. This is much better than small, normal, and big

A lot of types are represented by an ASCII letter. Due to the fact that many types are statically sized and many are dynamic, using binary data may be better with a bit pattern like S TTTT LLL. S is 1 if its statically sized, T is the type, and L is the length (could be reduced to 2). So for example Small Char is 8 bits, static. So its 1. Then lets assign it an id, say 0111. Now its length, its 8-bits or one byte, so lets give it that length. 001. So the full ID is 1011 1001. The last two could even just count, so 32 would be size 3, or 2^3 so its 011, or 3 (for size 3 or 2^3). Just a suggestion. I think it would add nice semantics to the binary that could be used for parsers while still allowing types to be parsed the same way (aka match byte value).

I briefly thought about something like this earlier, but decided against it for compatibility (though that doesn't really matter right now) and that having ascii denote the types is kind of fun. There are getting to be enough types in the spec now that it is a bit difficult to give them ids and I really like the idea of being able to have a sub-type sort of thing going on. This idea of yours feels very clean and well thought out.

If thats only a build dependency and adds little to no overhead that would be great! The big worry with that is often when people dont want to use async its because of the overhead it comes with, and turning async code into sync code may keep much of that overhead if not done right.

Sorry, I didn't explain it very well. The crate is designed so that you can have wrappers around async specific logic so that you can have one set of business logic and if built in async, it uses the async wrappers and if sync, it uses the sync wrappers.

Say if I was doing something with http requests, I could set it up to use reqwest if built with async, but ureq if built with sync. I don't have to maintain two copies of the code and there is no hidden runtime that is needed, or built.

Im too tired to think about this maningfully right now but it sounds good.

Don't forget to sleep and take a break when you need one :)

tryoxiss commented 9 months ago

Okay! I took a look at it and noticed the signature contains "mbn" in ascii, feels like an oppourtunity to have part of it say "mbon".

That's fair, I was mostly trying to use the protections that png uses.

I don't know how much the protections actally help, its good in theory though and should be kept.

The reserved type has an id and a size indicator, both a minimum of 2 bytes. If you were to set the size indicator to 0, then the reserve items total size would be 2 bytes long (I was thinking of having a minimum of 1 byte of reserved content which would make it 3 bytes.) If there was a change that reduced the size of an item by 1 byte. We would have to move everything by 1 byte, either to make space for the reserved, or to remove the padding. With an empty item, then we could just set that extra padding to \0 and it would work since the mark for empty is \0 and there is no content associated with it.

The idea is to use empty only when necessary, otherwise reserved is preferred.

Okay! Works and makes sense!

The thing with that though, is that now we need to know the location of all heap pointer recievers. So whenever there is a pointer we havent cached yet, we need to read the heap until we find it, which is slow.

I don't think we would need to know the location of each receiver when the addresses are relative to the heap, just where the heap is. And with the nature of the format, it should be pretty quick to find the heap.

Oh if they are relative to the start of the heap that works yeah.

After thinking about it though, I think there are more benefits to having pointers be relative to the file rather than to the heap.

There defintely are, such as the abbility to have multiple heaps mixed with other data. Either approach works though the pros and cons should be weighed carefully.

After looking through, I noticed that there are 3 char types. I understand the rationale, but since UTF codepoints self-describe length with the continue bit I do wonder if its necesary. It may be good even if its not, though.

I thought about it, but I feel like it goes against the philosophy of the type system, where I would need to encode the value of the character in the mark itself in order to know how long the character is. So a size indicator would be needed, and at that point you have a string.

Fair enough. Seemed a bit redundent but I do think I partially agree still.

For strings, you should note that L is the expanding format described above (assuming it is). (actually same with most types). You should also use a consistant letter for lengths where possible, L is good and already used in most places.

I should explain in the spec why I use the letters I use: L is for the length in bytes and N is for the number of items. I will change the mark I in array to V to be more consistent with mark names.

Ahh okay, that makes sense!

Thanks, I was in a different head-space when I thought of this. This is much better than small, normal, and big

Makes sense, two brains will come up with better stuff than one after all.

A lot of types are represented by an ASCII letter. Due to the fact that many types are statically sized and many are dynamic, using binary data may be better with a bit pattern like S TTTT LLL. S is 1 if its statically sized, T is the type, and L is the length (could be reduced to 2). So for example Small Char is 8 bits, static. So its 1. Then lets assign it an id, say 0111. Now its length, its 8-bits or one byte, so lets give it that length. 001. So the full ID is 1011 1001. The last two could even just count, so 32 would be size 3, or 2^3 so its 011, or 3 (for size 3 or 2^3). Just a suggestion. I think it would add nice semantics to the binary that could be used for parsers while still allowing types to be parsed the same way (aka match byte value).

I briefly thought about something like this earlier, but decided against it for compatibility (though that doesn't really matter right now) and that having ascii denote the types is kind of fun. There are getting to be enough types in the spec now that it is a bit difficult to give them ids and I really like the idea of being able to have a sub-type sort of thing going on. This idea of yours feels very clean and well thought out.

Yeah, I spent a lot of time figuring out how to do that for my format and thought it would be good here. It makes it nice and easy to assign types :3.

If thats only a build dependency and adds little to no overhead that would be great! The big worry with that is often when people dont want to use async its because of the overhead it comes with, and turning async code into sync code may keep much of that overhead if not done right.

Sorry, I didn't explain it very well. The crate is designed so that you can have wrappers around async specific logic so that you can have one set of business logic and if built in async, it uses the async wrappers and if sync, it uses the sync wrappers.

Say if I was doing something with http requests, I could set it up to use reqwest if built with async, but ureq if built with sync. I don't have to maintain two copies of the code and there is no hidden runtime that is needed, or built.

Ahh okay, thats pretty great actually. How did I not know about this???

Im too tired to think about this maningfully right now but it sounds good.

Don't forget to sleep and take a break when you need one :)

Thanks, I needed one :sweat: . Still do, but I felt like replying now anyway.

I havent replied in a while, sorry :sweat: . Also reading over this I made so many dumb spelling mistakes... aaaa!

ttocsneb commented 9 months ago

I don't know how much the protections actally help, its good in theory though and should be kept.

True, I think those protections are better for old systems of transport which could change characters thinking it was supposed to be text.

Ahh okay, thats pretty great actually. How did I not know about this???

I just discovered it recently too through theprimeagen's Async Rust Is The Bane Of My Existence | Prime Reacts which reacts to this article The bane of my existence: Supporting both async and sync code in Rust.

I havent replied in a while, sorry 😓 . Also reading over this I made so many dumb spelling mistakes... aaaa!

Don't worry about it! I've needed to step away for a bit as well to give myself a breather.

I'm going off on a bit of a tangent, but In a linguistics class I took once, we talked a lot about descriptive language. Our language is defined by how we use it and not how it should be used. The whole purpose of language is to be able to convey ideas between people and if we can do that--even if it's a bit misspelled--then the language is still doing its job.

I'm pretty happy with what is done so far. I feel like there is a bit more that I will need to put into the spec before I could consider it to be done, but we're at the point where I can start working on an implementation and fill in some of the unknowns as I go.

I've started working on code that implements this new specification. Thinking about how I will actually implement the engine is pretty daunting, but I think I can get it to work. I'll probably add a kanban project or something here to better plan out and keep track of the implementation.

tryoxiss commented 9 months ago

True, I think those protections are better for old systems of transport which could change characters thinking it was supposed to be text.

Designing for bad systems probably isn't the best. If a system changes it to be text, then something has gone deeply wrong as that could easily affect the content. Error correction systems should never affect the conents of the message.

I just discovered it recently too through theprimeagen's Async Rust Is The Bane Of My Existence | Prime Reacts which reacts to this article The bane of my existence: Supporting both async and sync code in Rust.

Ahhh, cool!

I havent replied in a while, sorry 😓 . Also reading over this I made so many dumb spelling mistakes... aaaa!

Don't worry about it! I've needed to step away for a bit as well to give myself a breather.

I'm going off on a bit of a tangent, but In a linguistics class I took once, we talked a lot about descriptive language. Our language is defined by how we use it and not how it should be used. The whole purpose of language is to be able to convey ideas between people and if we can do that--even if it's a bit misspelled--then the language is still doing its job.

True, and I agree entirely. I also just worry about coming off like a fool.

I'm pretty happy with what is done so far. I feel like there is a bit more that I will need to put into the spec before I could consider it to be done, but we're at the point where I can start working on an implementation and fill in some of the unknowns as I go.

Yeah! Its looking pretty good so far.

I've started working on code that implements this new specification. Thinking about how I will actually implement the engine is pretty daunting, but I think I can get it to work. I'll probably add a kanban project or something here to better plan out and keep track of the implementation.

Okay! Let me know if theres anything I could do to help out (writing tests, docs, etc). I'm not the best with low-level stuff or parsers but I hope I can do something.

ttocsneb commented 9 months ago

Designing for bad systems probably isn't the best. If a system changes it to be text, then something has gone deeply wrong as that could easily affect the content. Error correction systems should never affect the conents of the message.

Totally agree. If something were to go horribly wrong, which is extremely unlikely, then the signature would become invalid and the file wouldn't be read.

True, and I agree entirely. I also just worry about coming off like a fool.

My writing has never been the best, and I commonly worry that what I write isn't going to make sense or that what I'm trying to convey isn't actually what is being said. It really means a lot to me with what you said in the f188e2e about having precision and clarity.

Okay! Let me know if theres anything I could do to help out (writing tests, docs, etc). I'm not the best with low-level stuff or parsers but I hope I can do something.

That would be great! I should get my code and plans organized first, but once I'm ready, I will let you know.

ttocsneb commented 9 months ago

I've pushed the code I've written so far to the rewrite branch. It's definitely not in a usable state yet, but has the ideas of how I want to implement the spec. I also uploaded the docs to my website under mbon.benjaminja.info.

I've published a project called rewrite which should help keep track of the progress of the rewrite. I'm still figuring out how I want to organize it though.

I've also invited you to have write access to both the repository and the project.

I'm not that great at planning data structures--I have a tendency to overcomplicate things, so any feedback would be greatly appreciated.

tryoxiss commented 8 months ago

True, and I agree entirely. I also just worry about coming off like a fool.

My writing has never been the best, and I commonly worry that what I write isn't going to make sense or that what I'm trying to convey isn't actually what is being said. It really means a lot to me with what you said in the f188e2e about having precision and clarity.

There has defintely been a few times like that, but when asking for clairity you have always explained it well and cleared up misconceptions. Which really is what matters.

Okay! Let me know if theres anything I could do to help out (writing tests, docs, etc). I'm not the best with low-level stuff or parsers but I hope I can do something.

That would be great! I should get my code and plans organized first, but once I'm ready, I will let you know.

Okay, thanks!