Closed casey closed 5 years ago
I tend to think of Bao encodings kind of like a NaCl secretbox or something like that. It's a very minimal format, intended for other applications to build on top of. It's pretty unlikely that a regular user will get their hands on a .bao
file in practice, just like it's unlikely that they'll get their hands on a secretbox or a TLS certificate. Instead these things live inside of databases or network protocols, where it's not the format's job to tag itself.
I think that's fair. It might be worth thinking of .bao
files as being slightly different from the bao
encoding. I.E. the files that the bao
command line tool produces and consumes have a magic number in addition to containing bao
encoded data.
I wouldn't want the library code to be in a position where sometimes it needs to strip a magic number and sometimes it doesn't. I think that would lead to more confusion rather than less, when callers get that wrong.
It sounds like you have a scenario in mind where end users are going to be getting their hands on encoded files. Can you tell me what scenario you're picturing?
I've been thinking about a lot of the issues that librarians and other people involved in digital preservation face, and a common problem is simply not being able to identify file formats.
I'm imagining a case where an application, archive, or dataset contains bao encoded files, gets into the hands of someone who's tasked with preserving, identifying, or archiving it, and isn't able to identify it based on the data it contains. (For example with a tool like siegfried, a tool for archivists which identifies files based on their contents, using a few databases of file signatures.) If bao encoded files contained a magic number, then identification and extraction would be much easier.
The other case I'm thinking of is when an archivists decided to use bao encoded files as a storage format. A magic number would make it more likely that data could be eventually extracted when needed, if supporting metadata or external databases describing the contents of stored files are lost, but the files themselves are intact.
My current thinking is that 1) the encoding itself is obsessed with efficiency and shouldn't add extra bytes, and 2) the bao encode
command needs to produce files that are compatible with the library. For those reasons I'm going to close this one.
It might be good to include a fixed, recognizable sequence of bytes at the beginning of bao encodings. For example, the ascii bytes
BaO!
The weird capitalization is chosen to minimize the possibility of conflicts with a textfile starting with the same bytes. Using a 4 or 8 byte magic number with one or more non-ascii/non-utf8 bytes could further reduce ambiguity.There are a few benefits to including a magic number that I can think of:
file
can help users more easily identify bao encodingsBaO1
and changing it toBaO2
(This can also be accomplished with a version byte.)Downsides that I can think of: