parkan commented 7 years ago

I've been thinking about how to best create interop between the components in our stack:

IPLD/IPFS (probably read-only from IPFS side): as per conversations with @whyrusleeping and @jbenet, the best way to populate a (very) large number of IPLD objects is by directly building a datastore. The IPFS datastore is defined as an interface, found here, currently implemented as either files or leveldb, but also possibly backed by postgresql (shipped), LMDB, etc
Indexer: the indexer currently keeps a copy of all document bodies in memory, but this is not required -- can use ES only for TF-IDF. We need to read 200-600 candidate items per search, which is a lot, can't pipe this over HTTP. Already storing many pieces (vectors) outside of ES in LMDB.
Concat: we currently have interfaces for reading/writing single statements, but not batches

Can we use a shared memory-mapped db like LMDB? Imagine a shared "object store" (in IPFS parlance) which is a KV that fulfills go-datastore interface and contains the bare statements, while IPFS, concat, indexer each have their own meta-metadata (provider records, statement envelopes, vectors) referring to them. This is potentially not entirely compatible with reconciliation/bag of statements/wire format, need to think more about that, but overall I think this is a worthwhile question.

parkan commented 7 years ago

@vyzo let's discuss this tomorrow, I think this is a relatively important connecting piece, but willing to hear counterarguments

vyzo commented 7 years ago

@parkan batch reading is supported with a query, no need to add a separate interface. We can easily add a batch write interface.

vyzo commented 7 years ago

Also, I don't think we should store bare statements outside the db, there is no need for that. We will need a KV store for the actual metadata objects though.

vyzo commented 7 years ago

Nonetheless, the database backing the concat node can be shared when using an RDBMS server.

vyzo commented 7 years ago

34 adds some better facilities for batch read/writes to the db: batch writes through the /publish endpoint and counter/order MCQL support.

parkan commented 7 years ago

OK, we should discuss the relationship between statements and actual metadata then, it sounds like the latter is what needs to be shared with indexer.

parkan commented 7 years ago

Ok, given that we've decided to move forward with RocksDB, we can't have multiple writers and so this approach is not possible.

vyzo commented 7 years ago

Perhaps we can have a datastore implementation that uses the IPFS datastore through the API, but that's unworkable for large datasets.

The symmetric approach is that concat can act as a datastore through the API (with hash-based retrieval) for IPFS. We can't implement the crazy go-datastore interface however, but does IPFS really use all that crazy stuff with hierarchical keys and queries?

Also of note, is that we can temporarily shutdown the concat node and directly load RocksDB for bulk ingestion. It has a special config optimized for bulk writes, that disables compaction and stores everything as level 0. When you are done, you manually run compaction, and reopen the db for general purpose access.

parkan commented 7 years ago

Punting on this, and may in fact not do it directly at all; thinking now is potentially focused on https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

mediachain / concat

Shared datastore #33

I've been thinking about how to best create interop between the components in our stack:

34 adds some better facilities for batch read/writes to the db: batch writes through the /publish endpoint and counter/order MCQL support.