Define the archive format

killercup commented 5 years ago

Let's define the format of out archives.

Current state

A binary file that is actually just concatenated gzip blobs.

Features:

Extract gzip files
Append is trivial

Prior art

WARC. An implementation seems to live here.
- I have never used this, but someone pointed it out on Twitter.
- It's a long spec
- Not sure append is possible
.tar.gz files
- It's well-known
- It's from the 70s with all the 'features' that come with it
- Append?

What I learned: GZIP members

While reading the WARC spec I found this interesting section:

As specified in 2.2 of the GZIP specification (see [RFC 1952]), a valid GZIP file consists of any number of GZIP “members”, each independently compressed.

Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record’s starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

I did not know this about gzip! If I'm reading this correctly, it means that we can, in theory use files compatible with tar (or WARC) with the additional requirement that each file is a new GZIP member (so that we can continue to get slices from our index file that point to valid gzip files we can serve).

Options

continue to use custom archive format, but specify it, and maybe add some stuff to sure forward-compatibility
use tar, and find a way to ensure gzip members are used
- research how to do append, or skip appending as an update strategy altogether

cc @QuietMisdreavus

QuietMisdreavus commented 5 years ago

It's worth noting that rustdoc doesn't just want to append to an archive, but also to update files that already exist in the archive...

For context: When Cargo runs a cargo doc command, it invokes rustdoc multiple times on the same output directory, once for each dependency. This allows it to update a handful of shared files - the search index, the new source files index, the shared CSS/JS/font resources - so that the whole dependency tree can act like a single unit. The important piece here is that we need to be able to read in the existing search index (for example), add in the records for the crate being documented, and save it back into the archive.

If i understand the current format correctly (note: have not done any actual reading on it) this could be as trivial as removing it from the current archive, modifying it in-memory, then saving it on the end and updating the index appropriately. But if static-filez goes to a format where the files are going to be more interleaved, that will be more difficult. (It sounds like that's not going to happen, but it's worth noting.)

killercup commented 5 years ago

A quick way to "support" this is to just append the overwritten files and have the index point at the last version only.

On Sat, 8 Dec 2018, 17:44 QuietMisdreavus, notifications@github.com wrote:

It's worth noting that rustdoc doesn't just want to append to an archive, but also to update files that already exist in the archive...

For context: When Cargo runs a cargo doc command, it invokes rustdoc multiple times on the same output directory, once for each dependency. This allows it to update a handful of shared files - the search index, the new source files index, the shared CSS/JS/font resources - so that the whole dependency tree can act like a single unit. The important piece here is that we need to be able to read in the existing search index (for example), add in the records for the crate being documented, and save it back into the archive.

If i understand the current format correctly (note: have not done any actual reading on it) this could be as trivial as removing it from the current archive, modifying it in-memory, then saving it on the end and updating the index appropriately. But if static-filez goes to a format where the files are going to be more interleaved, that will be more difficult. (It sounds like that's not going to happen, but it's worth noting.)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/killercup/static-filez/issues/9#issuecomment-445472284, or mute the thread https://github.com/notifications/unsubscribe-auth/AABOXzHHhbzJ2XrK9zJlnIP2VZG5BhAQks5u2-xYgaJpZM4ZJsm- .

lnicola commented 5 years ago

Why not ZIP? .tgz is a pretty poor format for random access, and would probably require an external index.

killercup commented 5 years ago

Does a zip archive allow us to get files out of it as individual gzip streams so we can send them without extracting and re-compressing?

lnicola commented 5 years ago

The format itself should allow you to get a deflated stream directly out of the archive. You can test this with zip foo.zip foo.txt and zlib-flate -compress < foo.txt > foo.d, then by looking at both with a hex editor. foo.zip will have an extra header and footer, but the compressed data is identical. I don't know if the zip crate allows that, but it sounds like useful functionality that could reasonably be added to it.

Wrt. browser support, both Firefox and Chrome send Accept-Encoding: gzip, deflate.

In any case, I don't expect a documentation browser to get thousands of requests per second.

lnicola commented 5 years ago

Yes, it can:

lnicola commented 5 years ago

Another thing to consider is that if you're just browsing the docs on your computer, you might as send the files to the browser without compression. And if you want to host your crate's documentation somewhere, static file hosting is probably more accessible than a VPS or something that can run code.

I'm not sure what other use cases you're thinking of. Being able to serve compressed content might ultimately be a nice feature, but wouldn't really matter.

killercup commented 5 years ago

Interesting. My main concern with this crate is making a very efficient way to store and serve compressed data, and while the motivation is the use with rustdoc ideally it doesn't end there. So, when we choose a new archive format I wouldn't want it to have worse performance than the ad-hoc solution we have right now; it should only add compatibility -- either with existing applications or future versions/features of this/rustdoc.

On Wed, 13 Feb 2019, 18:54 Laurențiu Nicola, notifications@github.com wrote:

Another thing to consider is that if you're just browsing the docs on your computer, you might as send the files to the browser without compression. And if you want to host your crate's documentation somewhere, static file hosting is probably more accessible than a VPS or something that can run code.

I'm not sure what other use cases you're thinking of. Being able to serve compressed content might ultimately be a nice feature, but wouldn't really matter.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/killercup/static-filez/issues/9#issuecomment-463299535, or mute the thread https://github.com/notifications/unsubscribe-auth/AABOX8wnfBZfb8671zsHscoIks52rgOMks5vNFFRgaJpZM4ZJsm- .

lnicola commented 5 years ago

Fair enough, there's nothing bad in wanting it to be as fast as possible.

Nemo157 commented 4 years ago

It would be nice to support alternative compression formats, brotli/zstd would both be useful as they compress html better than gzip. Maybe the index could record a global or per-file format, and maybe even support multiple formats to allow the server to negotiate which to serve.

killercup / static-filez