Closed yusefnapora closed 8 years ago
This is where I end up wishing we were writing this in scala with Xors and monad transormers 😞
Either yielding the Exception directly or a (ret, Exception) tuple like in Go could work. I'm kind of inclined towards the second approach so we don't have to rely on the type system.
Ah, I see what you mean. There are a few other ways to do it:
1) Block forever (and keep retrying).
2) Just throw the exception.
3) APIs like ES just yield / return back a dict for each item e.g. [{"_id":234,"success":0},{"_id":456,"success":1,"ref":"Qknjdskfjsndfdsdfsd"}]
.
4) Get fancy with something like the tricks Tornado uses.
Maybe (3). That'd likely require all inserted items to have some kind of pre-blockchain ID for tracking, but that may be a good idea anyway.
I think (3) is more or less equivalent to the tuple idea. I guess you need the ids to know what to retry/report?
I could include the translator output in the yielded dictionary, maybe.
Yeah, for example JPEG / GIF have some exotic / rarely-used settings that can make our re-sizer not work, even though the images are technically valid. That issue already came before on sections of a few of our datasets.
Without the IDs of the failed artefacts, it'd be a real pain / slow to fix while doing a large ingestion.
Here was one change I've already had to make, to get valid images to not fail:
https://github.com/mediachain/mediachain-indexer/blob/master/mediachain/indexer/mc_ingest.py#L71
BTW, I also have resizing / snapshotting code for GIFs in the other repo.
@autoencoder can you look over that last commit? It yields a dictionary for success and failure. success looks like:
{'success': True,
'refs': { 'canonical: 'QmF00...' },
'translated': { } # translator ouptut
'raw_content': '' # raw bytes of input, before translation
}
and errors:
{'success': False,
'translated': { },
'raw_content: '',
'error_code': StatusCode.UNAVALABILE, # grpc error code
'error_details': '' # error message
'error_traceback': [] # list of lines in formatted traceback
}
I'm happy to tweak it; I think requiring a "native id" of some kind would make a few things easier, but I think we should mull it over a bit more... it seems like we'll want to ingest some things that don't have external ids and would have to figure out some way to handle that.
I could also include the parsed form of the raw content (the output of json.dumps(raw_content)
, basically) instead of / in addition to the raw_content
, if that would be useful.
"translated" and "raw_content" are the exact objects that were originally passed in? If so, sounds good. Likely that's enough to figure out exactly which records failed.
Also json.dumps(raw_content)
could actually fail, unless we escape the raw_content somehow. Some of our datasets have turned out to contain invalid unicode characters in them - likely someone messed up the encoding schemes somewhere along the way. Not sure exactly sure what we'll do for these. Maybe not a pressing issue.
translated
is the output of the translation module, so it will be a bunch of __mediachain_object__
s, etc..
parsed
is the dict representation of the original input - I'll add that in.
Good point about the parsing failures; I think that's not being handled well at the moment.
Ah, yes maybe we should discuss this on slack.
Was thinking that perhaps it would all be much simpler if the translator was called in a separate step prior to attempting to write to the blockchain via this API.
@yusefnapora can you glance over this and give me your view for where this is at?
ah yeah, this was the change to the translator output format I was talking about in slack this morning; I guess I forgot it hadn't landed in master yet.
The api change is that the dataset iterator and translator always return a dict, with a success
key that's true if the operation was successful. If not, the dict will contain error_code
, error_details
and error_traceback
members. For success and errors, the translator output dict contains the raw and parsed inputs as well, in case the downstream consumer needs them (e.g. to pull the original object id out of the parsed input).
Everything works well within the client's ingest -> blockchain flow, but I need to check with @autoencoder to see if we need to update the indexer. Basically, if the indexer is calling Writer.write_dataset
, it should check the success
key, and the refs are now nested in the output dictionary under the refs
key.
OK this sounds like a breaking API change, gonna bucket this into "follow-up" milestone
cool; this PR got out of hand (my perpetual struggle, lol). I think we should pull the first commit that does the backoff / retry into its own PR and merge that. I'll do that in a second
Sounds good to me
@yusefnapora let's review this tomorrow and decide whether to save + merge or abandon for divergence reasons
@yusefnapora ok and this one... seems like we may have diverged too far to salvage this?
yeah, probably :(
We should probably abandon and revisit later.
Word
This uses the same
with_retry
helper we're already using for the rpc datastore service in the transactor client methods.grpc
AbortionErrors
with status codeUNAVAILABLE
andUNKNOWN
will be retried up to 10x before bailing out.@autoencoder mentioned it would be good to bubble fatal errors up out of the Writer.write_dataset method, but I'm not sure what the correct way to do that is in a generator. At the moment it will just print the traceback and try the next record in the dataset, but if I throw, that would cancel the whole ingestion. I could just yield the exception instead of the result, and callers would have to check the type... or else yield a tuple of (maybe_error, maybe_result)...