refactor getty into getty + writer

parkan commented 8 years ago

We should separate the getty-specific code into a module and make writer generic, so that it can accept JSON directly

yusefnapora commented 8 years ago

so just to clarify before I dive in any deeper...

the writer module accepts raw data and then calls out to a translation module (e.g. getty)? and the getty module's job is to transform the raw data into a collection of mediachain records (Entity, Artefact, etc) and hand them back to the writer to actually send to the transactor...

Is that basically how you're thinking of this?

parkan commented 8 years ago

I think the "write" function accepts tuples of "raw" (text blob) + "mediachain metaschema" (json) metadata, which is written to transactor. Getty ideally takes a raw blob and outputs the the json as a pure function (but maybe not rn due to the author preprocessing)

So presumably writer has two entrypoints, one takes raw + translator name, one takes raw + processed.

yusefnapora commented 8 years ago

okay cool, that sounds good

autoencoder commented 8 years ago

FYI, there's a collection of dataset iterators here too, if that's of any use.

yusefnapora commented 8 years ago

@parkan do you have any thoughts about how to model relationships in the output of the translation module? Meaning for things like creating an entity + artefact + creation cell from one raw blob...

One way would be to have the translator basically create the Entity, etc objects directly, and return them in a dictionary with temporary ids...

something like

{
  "artefact_0": { 
     "type": "artefact", 
      "meta": { ... } 
  },
  "entity_0": { 
     "type": "entity", 
     "meta": { ... } 
  },
  "cell_0": { 
    "type": "artefactCreatedBy", 
    "ref": "artefact_0", 
    "entity": "entity_0", 
    "meta": { ... }
}

That gets a little tricky though... since the writer would have to basically make a DAG out of that and insert the entities from the top to get valid refs to use for the dependent objects. So in the example above, artefact_0 and entity_0 would have to be inserted before cell_0

Or we could do something like the "hub and spokes" model we had with l-space, where the artefact is considered the center, and we have a few relationship types defined.

In which case the translator would return something like

{
  "artefact": { ... },
  "related": {
    "artefactCreatedBy": {
       "type": "entity",
       "name": "fooman bar"
    }
  }
}

or something... the actual format would probably differ.

parkan commented 8 years ago

Right, this is the issue we wrestled with in L-SPACE. Let me think about it.

yusefnapora commented 8 years ago

cool. I started a branch here: https://github.com/mediachain/mediachain-client/tree/yn-getty-detangle

but it needs a lot of work still. I'll aim to have a PR open tomorrow.

parkan commented 8 years ago

Thinking about it again I arrive at more or less the same conclusion. I still believe that the translator per se should be stateless and pure and emit a "best effort" representation w/o hitting any indexes.

I think having a r=2 DAG (hub and spokes) is probably a nice complexity/expressiveness tradeoff; if we really need to extract 3+ layers of relationships from a schema we can add a preprocessing step.

So we'd have like (assuming embedded/named authors):

{
  "artefact": { ... },
  "related": {
    "artefactCreatedBy": {
       "type": "entity",
       "object": { id: 1234, name: "fooman bars" }
    }
  }
}

and if the author is only specified by id, we get a stripped down version:

{
  "artefact": { ... },
  "related": {
    "artefactCreatedBy": {
       "type": "entity",
       "object": { id: 1234 } // just id
    }
  }
}

Then the next step tries to upgrade these objects in a stateful way. Basically, I want to decouple translation and lookups as much as possible so there's a simple way to express the former.

I suspect that this will be a bit easier to realize in python where we have much more arbitrarily shaped data.

parkan commented 8 years ago

(protip: using js instead of json as the quote type fixes highlighting)

yusefnapora commented 8 years ago

yeah, I think that's a good way to go. having the translator be a pure function makes things a lot simpler, and the writer is necessarily stateful anyway, since it has to interact with the transactor, etc. and it's whole purpose is producing side effects

parkan commented 8 years ago

I think the getty package will eventually migrate to a translator project, also

yusefnapora commented 8 years ago

Making pretty good progress on this with #17, but it still needs a bit of cleanup and testing. The big missing piece is artist deduplication; the original getty importer did that locally by scanning through the whole local dataset and inserting a single Entity for each unique artist name.

That strategy really only works if you're importing an entire large dataset upfront into an empty blockchain. If there are Entities already in the system, we should be querying against those too.

The local dedup is also a bit harder to fit into this new modular design, although it could be done by having the Writer hang onto the translated output from the whole dataset and "collapsing" similar entities before sending the whole batch off to the transactor.

I can set that up for now, but I'd like to sort out the long-term strategy. At the moment I basically see two main approaches.

for each artist parsed from the getty json, the writer calls out to an indexer (which has global knowledge of the blockchain) and asks for artists matching the parsed name. If none exist, it inserts a new one.
the writer always inserts an Entity and doesn't try to do any deduplication. When the whole batch is finished, it issues a request to the indexer to go through and dedup everything in one batch, and the indexer writes new cells out to "collapse" the duplicates, by linking deprecated Entities to a single "chosen one".

Both approaches have issues...

The first way involves a lot of back-and-forth between the writer and the indexer. Also it means that we need to wait for the indexer to update between each insert, because the next insert might depend on the previous one.

The second way leads to a bunch of immediately deprecated Entities in the blockchain, and we need to figure out how that deprecation / deduplication works, implement reader support, etc. This is something we need to figure out anyway, because this problem is bound to come up again in other contexts (e.g. same person exists in two datasets and we need to combine their chains).

@parkan, @tg - any thoughts on this?

vyzo commented 8 years ago

Alternative you can deterministically produce the entity (without a timestamp) and use the reference returned, either on first insert or the one returned from duplicate inserts.

yusefnapora commented 8 years ago

yeah, that's a great point vyzo. if we put the timestamp, etc into a cell instead of the main canonical, we could make the canonical hash deterministic but keep the timestamp

yusefnapora commented 8 years ago

That seems like the simplest way forward; the indexer will still be needed for more sophisticated deduplication (across datasets, etc). But for identical records from the same translator, we can just get the ref from the duplicate insert error and not clutter up the blockchain with a bunch of records that only differ in timestamp

parkan commented 8 years ago

You mean putting timestamp into an update?

yusefnapora commented 8 years ago

I'm not sure tbh... it seems like it would need its own cell type; something that indicates that this canonical was the result of a translation step at a particular point in time. Then if two translation passes result in an identical canonical, you'd end up with two "translatedBy" (or whatever) cells, but that seems better than two canonicals for the same raw metadata + translator.

denisnazarov commented 8 years ago

have we decided on a path forward?

yusefnapora commented 8 years ago

not really. i've been side-stepping the issue in #17 by not adding a timestamp at all, which results in deterministic translation output. Then I've just been catching the duplicate insert errors for identical Entities.

I'm not sure what to do about timestamping, tbh. I kind of feel like having a client-generated timestamp is weird, since we have no way of controlling or verifying that their clock is set to a sane value. It seems like timestamps in the journal Insert and Update events makes more sense, since at least there's a closed set of transactors.

parkan commented 8 years ago

I agree that timestamping should be either generated or validated by trusted system components (stale or future writes not accepted into quorum)

Let's defer on this for the moment.

mediachain / oldchain-client

refactor getty into getty + writer #16