Closed yusefnapora closed 7 years ago
I think it might be a mistake to have each statement in its own file. Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)
I think the better approach is to batch statements together in json, in one (or more) ndjson files in stmt/
Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}
. Then we could just feed the ndjson directly into the concat API with curl or mcclient or whatever.
With gzipping the extra overhead might from the json might not be too bad, although we'd take a hit from base64, of course.
@vyzo
Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)
OK, that's fair, though I wouldn't call this "its own file" -- it's a big file with large-ish record delimiters, not separate files. The advantage is potentially higher recoverability/seeking due to presence of headers but since we're treating this as all-or-nothing I can live with big ol' ndjson
@yusefnapora
Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}
Might as well just store a base64 object per line then? Don't really need the extra markup
I figured that it would be easier to just make the tarball in JS using the tar-stream module vs writing some shell or python scripts, since we already have the code to extract and request the associated objects.
This adds an
mcclient archive <queryString>
command that will write a gzipped tarball to stdout (or you can give a--output|-o
flag). In the tarball will be astmt/<statementId>
entry for each statement, and adata/<objectId>
entry for each data object. The statements are stringified JSON objects, but could easily do protobufs instead.Happy to tweak the archive format in the morning (multiple statements per entry is probably a good idea).