WIP: package output + status messages in dicts yielded by writer iterator

yusefnapora commented 8 years ago

This uses the same with_retry helper we're already using for the rpc datastore service in the transactor client methods.

grpc AbortionErrors with status code UNAVAILABLE and UNKNOWN will be retried up to 10x before bailing out.

@autoencoder mentioned it would be good to bubble fatal errors up out of the Writer.write_dataset method, but I'm not sure what the correct way to do that is in a generator. At the moment it will just print the traceback and try the next record in the dataset, but if I throw, that would cancel the whole ingestion. I could just yield the exception instead of the result, and callers would have to check the type... or else yield a tuple of (maybe_error, maybe_result)...

parkan commented 8 years ago

This is where I end up wishing we were writing this in scala with Xors and monad transormers 😞

Either yielding the Exception directly or a (ret, Exception) tuple like in Go could work. I'm kind of inclined towards the second approach so we don't have to rely on the type system.

autoencoder commented 8 years ago

Ah, I see what you mean. There are a few other ways to do it:

1) Block forever (and keep retrying).

2) Just throw the exception.

3) APIs like ES just yield / return back a dict for each item e.g. [{"_id":234,"success":0},{"_id":456,"success":1,"ref":"Qknjdskfjsndfdsdfsd"}].

4) Get fancy with something like the tricks Tornado uses.

Maybe (3). That'd likely require all inserted items to have some kind of pre-blockchain ID for tracking, but that may be a good idea anyway.

parkan commented 8 years ago

I think (3) is more or less equivalent to the tuple idea. I guess you need the ids to know what to retry/report?

yusefnapora commented 8 years ago

I could include the translator output in the yielded dictionary, maybe.

autoencoder commented 8 years ago

Yeah, for example JPEG / GIF have some exotic / rarely-used settings that can make our re-sizer not work, even though the images are technically valid. That issue already came before on sections of a few of our datasets.

Without the IDs of the failed artefacts, it'd be a real pain / slow to fix while doing a large ingestion.

Here was one change I've already had to make, to get valid images to not fail:

https://github.com/mediachain/mediachain-indexer/blob/master/mediachain/indexer/mc_ingest.py#L71

BTW, I also have resizing / snapshotting code for GIFs in the other repo.

yusefnapora commented 8 years ago

@autoencoder can you look over that last commit? It yields a dictionary for success and failure. success looks like:

{'success': True,
 'refs': { 'canonical: 'QmF00...' },
 'translated': { }  # translator ouptut
 'raw_content':  '' # raw bytes of input, before translation
}

and errors:

{'success': False,
 'translated': { },
 'raw_content: '',
 'error_code': StatusCode.UNAVALABILE, # grpc error code
 'error_details': '' # error message
 'error_traceback': [] # list of lines in formatted traceback
}

I'm happy to tweak it; I think requiring a "native id" of some kind would make a few things easier, but I think we should mull it over a bit more... it seems like we'll want to ingest some things that don't have external ids and would have to figure out some way to handle that.

I could also include the parsed form of the raw content (the output of json.dumps(raw_content), basically) instead of / in addition to the raw_content, if that would be useful.

autoencoder commented 8 years ago

"translated" and "raw_content" are the exact objects that were originally passed in? If so, sounds good. Likely that's enough to figure out exactly which records failed.

Also json.dumps(raw_content) could actually fail, unless we escape the raw_content somehow. Some of our datasets have turned out to contain invalid unicode characters in them - likely someone messed up the encoding schemes somewhere along the way. Not sure exactly sure what we'll do for these. Maybe not a pressing issue.

yusefnapora commented 8 years ago

translated is the output of the translation module, so it will be a bunch of __mediachain_object__s, etc..

parsed is the dict representation of the original input - I'll add that in.

Good point about the parsing failures; I think that's not being handled well at the moment.

autoencoder commented 8 years ago

Ah, yes maybe we should discuss this on slack.

Was thinking that perhaps it would all be much simpler if the translator was called in a separate step prior to attempting to write to the blockchain via this API.

parkan commented 8 years ago

@yusefnapora can you glance over this and give me your view for where this is at?

yusefnapora commented 8 years ago

ah yeah, this was the change to the translator output format I was talking about in slack this morning; I guess I forgot it hadn't landed in master yet.

The api change is that the dataset iterator and translator always return a dict, with a success key that's true if the operation was successful. If not, the dict will contain error_code, error_details and error_traceback members. For success and errors, the translator output dict contains the raw and parsed inputs as well, in case the downstream consumer needs them (e.g. to pull the original object id out of the parsed input).

Everything works well within the client's ingest -> blockchain flow, but I need to check with @autoencoder to see if we need to update the indexer. Basically, if the indexer is calling Writer.write_dataset, it should check the success key, and the refs are now nested in the output dictionary under the refs key.

parkan commented 8 years ago

OK this sounds like a breaking API change, gonna bucket this into "follow-up" milestone

yusefnapora commented 8 years ago

cool; this PR got out of hand (my perpetual struggle, lol). I think we should pull the first commit that does the backoff / retry into its own PR and merge that. I'll do that in a second

parkan commented 8 years ago

Sounds good to me

parkan commented 8 years ago

@yusefnapora let's review this tomorrow and decide whether to save + merge or abandon for divergence reasons

parkan commented 8 years ago

@yusefnapora ok and this one... seems like we may have diverged too far to salvage this?

yusefnapora commented 8 years ago

yeah, probably :(

We should probably abandon and revisit later.

parkan commented 8 years ago

Word

mediachain / oldchain-client

WIP: package output + status messages in dicts yielded by writer iterator #53