Interoperability of the datasets published in mcnet with ipfsnet?

rht commented 7 years ago

I tried to redo http://docs.mediachain.io/tutorial/moma (and serialize it into a script) if some parts can be substituted with the current existing ipfs toolset (this is part of my experimental effort of serializing pinned datasets in ipfsnet into the ipfs/archives repo).

If the schema is still published to mcnet, but the dataset itself is published to ipfsnet, would it still be possible to perform the SQL query as in the example? If yes, I think 'mc as the db' solution can be reused into other parallel efforts, e.g. https://github.com/ga4gh/cgtd for distributed genomics db.

The search engine indexer could be repurposed as well, where the vectorization method is adapted based on the context of the data (image, textual, cancer data, climate data, etc).

parkan commented 7 years ago

@rht a couple of points

The moma tutorial is now slightly outdated with respect to the tooling, will rework it v soon
Our nodes do talk to the IPFS mainnnet -- we currently support kad (client mode), id and ping protocols but no bitswap or relay. Peer lookups happen through the main IPFS DHT.
The underlying datastore for metadata objects is, in theory, IPFS's go-datastore compatible, but we haven't made the effort to put that interface on it yet

With those in mind, can you talk a little bit more about how you envision the integration? Sounds like something is definitely possible here.

rht commented 7 years ago

(datastore) I see, it looks like to be the reverse case with Ethereum's swarm, where devp2p is used for the DHT, but its datastore (and data exchange format) can be more readily dedup-ed/one-to-one-mapped. It was easier in the geth case since leveldb's interface is almost one-to-one-mapped with go-datastore (all that is needed is to cast interface{} into []byte). While on the perf aspect, a while ago (early last year I think) I tested swapping the leveldb on geth with flatfs used in ipfs (and only in ipfs so far afaict) by default, and I got a faster I/O throughput for various chunk size. I am going to rerun the benchmark with more test cases and tell you guys the result by this weekend.

(data/chunks exchange format) This is where it needs to be standardized so that mc's StatementDB could query from the dataset that has already been distributed in ipfs. The first checkpoint would be to query the cgtd data. For the test case, a schema could be generated from cgtd's example metadata. Perhaps @rcurrie could provide more info.

parkan commented 7 years ago

@rht our datastore is backed by RocksDB at the moment FWIW

It would be entirely plausible for data objects to be retrieved from a non-local datastore; we haven't quite worked out how we want to think about this just yet, but having both "naked objects" (w/o statements referring to them) and statements w/o local objects is something that we'd like to support. This may require rewriting object references into CIDs

rcurrie commented 7 years ago

@rht cgtd's schema is pretty light - a submission is a list of files and a list of fields with values. What the fields are and what their values are, or even what type of files are submitted is un-specified. Think TCP/IP - focus is on how bits (in this case cancer genomic data) gets moved around, not what's actually in them. Actually using the bits is left up to curation servers, or applications, or consumers.

mediachain / apps

Interoperability of the datasets published in mcnet with ipfsnet? #13