mediachain / aleph

א: The mediachain universe manipulation engine
MIT License
38 stars 16 forks source link

Add `mcclient archive` command to generate archive tarball #185

Closed yusefnapora closed 7 years ago

yusefnapora commented 7 years ago

I figured that it would be easier to just make the tarball in JS using the tar-stream module vs writing some shell or python scripts, since we already have the code to extract and request the associated objects.

This adds an mcclient archive <queryString> command that will write a gzipped tarball to stdout (or you can give a --output|-o flag). In the tarball will be a stmt/<statementId> entry for each statement, and a data/<objectId> entry for each data object. The statements are stringified JSON objects, but could easily do protobufs instead.

Happy to tweak the archive format in the morning (multiple statements per entry is probably a good idea).

vyzo commented 7 years ago

I think it might be a mistake to have each statement in its own file. Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)

I think the better approach is to batch statements together in json, in one (or more) ndjson files in stmt/

yusefnapora commented 7 years ago

Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}. Then we could just feed the ndjson directly into the concat API with curl or mcclient or whatever.

With gzipping the extra overhead might from the json might not be too bad, although we'd take a hit from base64, of course.

parkan commented 7 years ago

@vyzo

Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)

OK, that's fair, though I wouldn't call this "its own file" -- it's a big file with large-ish record delimiters, not separate files. The advantage is potentially higher recoverability/seeking due to presence of headers but since we're treating this as all-or-nothing I can live with big ol' ndjson

@yusefnapora

Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}

Might as well just store a base64 object per line then? Don't really need the extra markup