Closed parkan closed 8 years ago
so just to clarify before I dive in any deeper...
the writer module accepts raw data and then calls out to a translation module (e.g. getty)? and the getty module's job is to transform the raw data into a collection of mediachain records (Entity, Artefact, etc) and hand them back to the writer to actually send to the transactor...
Is that basically how you're thinking of this?
I think the "write" function accepts tuples of "raw" (text blob) + "mediachain metaschema" (json) metadata, which is written to transactor. Getty ideally takes a raw blob and outputs the the json as a pure function (but maybe not rn due to the author preprocessing)
So presumably writer has two entrypoints, one takes raw + translator name, one takes raw + processed.
okay cool, that sounds good
FYI, there's a collection of dataset iterators here too, if that's of any use.
@parkan do you have any thoughts about how to model relationships in the output of the translation module? Meaning for things like creating an entity + artefact + creation cell from one raw blob...
One way would be to have the translator basically create the Entity, etc objects directly, and return them in a dictionary with temporary ids...
something like
{
"artefact_0": {
"type": "artefact",
"meta": { ... }
},
"entity_0": {
"type": "entity",
"meta": { ... }
},
"cell_0": {
"type": "artefactCreatedBy",
"ref": "artefact_0",
"entity": "entity_0",
"meta": { ... }
}
That gets a little tricky though... since the writer would have to basically make a DAG out of that and insert the entities from the top to get valid refs to use for the dependent objects. So in the example above, artefact_0 and entity_0 would have to be inserted before cell_0
Or we could do something like the "hub and spokes" model we had with l-space, where the artefact is considered the center, and we have a few relationship types defined.
In which case the translator would return something like
{
"artefact": { ... },
"related": {
"artefactCreatedBy": {
"type": "entity",
"name": "fooman bar"
}
}
}
or something... the actual format would probably differ.
Right, this is the issue we wrestled with in L-SPACE. Let me think about it.
cool. I started a branch here: https://github.com/mediachain/mediachain-client/tree/yn-getty-detangle
but it needs a lot of work still. I'll aim to have a PR open tomorrow.
Thinking about it again I arrive at more or less the same conclusion. I still believe that the translator per se should be stateless and pure and emit a "best effort" representation w/o hitting any indexes.
I think having a r=2 DAG (hub and spokes) is probably a nice complexity/expressiveness tradeoff; if we really need to extract 3+ layers of relationships from a schema we can add a preprocessing step.
So we'd have like (assuming embedded/named authors):
{
"artefact": { ... },
"related": {
"artefactCreatedBy": {
"type": "entity",
"object": { id: 1234, name: "fooman bars" }
}
}
}
and if the author is only specified by id, we get a stripped down version:
{
"artefact": { ... },
"related": {
"artefactCreatedBy": {
"type": "entity",
"object": { id: 1234 } // just id
}
}
}
Then the next step tries to upgrade these objects in a stateful way. Basically, I want to decouple translation and lookups as much as possible so there's a simple way to express the former.
I suspect that this will be a bit easier to realize in python where we have much more arbitrarily shaped data.
(protip: using js
instead of json
as the quote type fixes highlighting)
yeah, I think that's a good way to go. having the translator be a pure function makes things a lot simpler, and the writer is necessarily stateful anyway, since it has to interact with the transactor, etc. and it's whole purpose is producing side effects
I think the getty
package will eventually migrate to a translator
project, also
Making pretty good progress on this with #17, but it still needs a bit of cleanup and testing. The big missing piece is artist deduplication; the original getty importer did that locally by scanning through the whole local dataset and inserting a single Entity
for each unique artist name.
That strategy really only works if you're importing an entire large dataset upfront into an empty blockchain. If there are Entities already in the system, we should be querying against those too.
The local dedup is also a bit harder to fit into this new modular design, although it could be done by having the Writer hang onto the translated output from the whole dataset and "collapsing" similar entities before sending the whole batch off to the transactor.
I can set that up for now, but I'd like to sort out the long-term strategy. At the moment I basically see two main approaches.
Both approaches have issues...
The first way involves a lot of back-and-forth between the writer and the indexer. Also it means that we need to wait for the indexer to update between each insert, because the next insert might depend on the previous one.
The second way leads to a bunch of immediately deprecated Entities in the blockchain, and we need to figure out how that deprecation / deduplication works, implement reader support, etc. This is something we need to figure out anyway, because this problem is bound to come up again in other contexts (e.g. same person exists in two datasets and we need to combine their chains).
@parkan, @tg - any thoughts on this?
Alternative you can deterministically produce the entity (without a timestamp) and use the reference returned, either on first insert or the one returned from duplicate inserts.
yeah, that's a great point vyzo. if we put the timestamp, etc into a cell instead of the main canonical, we could make the canonical hash deterministic but keep the timestamp
That seems like the simplest way forward; the indexer will still be needed for more sophisticated deduplication (across datasets, etc). But for identical records from the same translator, we can just get the ref from the duplicate insert error and not clutter up the blockchain with a bunch of records that only differ in timestamp
You mean putting timestamp into an update?
I'm not sure tbh... it seems like it would need its own cell type; something that indicates that this canonical was the result of a translation step at a particular point in time. Then if two translation passes result in an identical canonical, you'd end up with two "translatedBy" (or whatever) cells, but that seems better than two canonicals for the same raw metadata + translator.
have we decided on a path forward?
not really. i've been side-stepping the issue in #17 by not adding a timestamp at all, which results in deterministic translation output. Then I've just been catching the duplicate insert errors for identical Entities.
I'm not sure what to do about timestamping, tbh. I kind of feel like having a client-generated timestamp is weird, since we have no way of controlling or verifying that their clock is set to a sane value. It seems like timestamps in the journal Insert and Update events makes more sense, since at least there's a closed set of transactors.
I agree that timestamping should be either generated or validated by trusted system components (stale or future writes not accepted into quorum)
Let's defer on this for the moment.
We should separate the getty-specific code into a module and make writer generic, so that it can accept JSON directly