Consider including magic number in bao encoded files

oconnor663 / bao

an implementation of BLAKE3 verified streaming

Other

472 stars 23 forks source link

Consider including magic number in bao encoded files #23

Closed casey closed 5 years ago

casey commented 5 years ago

It might be good to include a fixed, recognizable sequence of bytes at the beginning of bao encodings. For example, the ascii bytes BaO! The weird capitalization is chosen to minimize the possibility of conflicts with a textfile starting with the same bytes. Using a 4 or 8 byte magic number with one or more non-ascii/non-utf8 bytes could further reduce ambiguity.

There are a few benefits to including a magic number that I can think of:

Tools like file can help users more easily identify bao encodings
You can change the magic number to indicate a new revision of the format, like starting with BaO1 and changing it to BaO2 (This can also be accomplished with a version byte.)

Downsides that I can think of:

4 or 8 extra bytes at the beginning of the file
More easily distinguishable from random bytes (although verifying the length and root hash would also distinguish a bao encoded file from random bytes)

oconnor663 commented 5 years ago

I tend to think of Bao encodings kind of like a NaCl secretbox or something like that. It's a very minimal format, intended for other applications to build on top of. It's pretty unlikely that a regular user will get their hands on a .bao file in practice, just like it's unlikely that they'll get their hands on a secretbox or a TLS certificate. Instead these things live inside of databases or network protocols, where it's not the format's job to tag itself.

casey commented 5 years ago

I think that's fair. It might be worth thinking of .bao files as being slightly different from the bao encoding. I.E. the files that the bao command line tool produces and consumes have a magic number in addition to containing bao encoded data.

oconnor663 commented 5 years ago

I wouldn't want the library code to be in a position where sometimes it needs to strip a magic number and sometimes it doesn't. I think that would lead to more confusion rather than less, when callers get that wrong.

It sounds like you have a scenario in mind where end users are going to be getting their hands on encoded files. Can you tell me what scenario you're picturing?

casey commented 5 years ago

I've been thinking about a lot of the issues that librarians and other people involved in digital preservation face, and a common problem is simply not being able to identify file formats.

I'm imagining a case where an application, archive, or dataset contains bao encoded files, gets into the hands of someone who's tasked with preserving, identifying, or archiving it, and isn't able to identify it based on the data it contains. (For example with a tool like siegfried, a tool for archivists which identifies files based on their contents, using a few databases of file signatures.) If bao encoded files contained a magic number, then identification and extraction would be much easier.

The other case I'm thinking of is when an archivists decided to use bao encoded files as a storage format. A magic number would make it more likely that data could be eventually extracted when needed, if supporting metadata or external databases describing the contents of stored files are lost, but the files themselves are intact.

oconnor663 commented 5 years ago

My current thinking is that 1) the encoding itself is obsessed with efficiency and shouldn't add extra bytes, and 2) the bao encode command needs to produce files that are compatible with the library. For those reasons I'm going to close this one.