mediachain / mediachain-indexer

search, dedupe, and media ingestion for mediachain
33 stars 14 forks source link

wip - catchup to last known block #31

Closed yusefnapora closed 7 years ago

yusefnapora commented 7 years ago

This updates the SimpleClient and receive_blockchain_into_indexer functions to use the new canonical_stream output format from https://github.com/mediachain/mediachain-client/pull/88

I added a cli flag so I could test catching up to a known block (--last-known-block=QmF00...), but plan to remove that soon. It seems like we should be writing out the last known block ref somewhere so we can read it in after a restart.

@autoencoder, where do you think that value should get stored? Could just spit it out to a file somewhere...

parkan commented 7 years ago

Does indexer have a rockdb so far?

parkan commented 7 years ago

This needs corresponding requirements.txt change

yusefnapora commented 7 years ago

the testnet indexer has rocksdb installed, yes.

parkan commented 7 years ago

should probably put it in there, then?

yusefnapora commented 7 years ago

yeah, I haven't so far since we need to coordinate with @autoencoder - it's only needed for the blockchain stream, and I didn't want to break any other deployments he's got :) But it would be cool to have it as a hard requirement, since the blockchain catchup will be memory hungry and transient without rocksdb installed.

parkan commented 7 years ago

Ok deferring to @autoencoder then, would also be ok to drop it into a ~/.mediachain or w/e

yusefnapora commented 7 years ago

yeah, the rocksdb blockchain cache gets stored in ~/.mediachain anyway, so we could just write the block ref to a file there.

parkan commented 7 years ago

I'm down to just drop a file then. Do we need any locking semantics? I guess not given that only 1 thread would write, though now that I think about it we probably need a pidfile/lock for the whole thing

yusefnapora commented 7 years ago

yeah, at the moment we shouldn't need locking, since we'll just read once before we start tailing the blockchain, and only write from one thread. but it might get more complicated as time goes on if we end up doing some kind of parallelization, etc

parkan commented 7 years ago

Yeah, we should probably drop a pidfile and refuse to run if another process is present for now

autoencoder commented 7 years ago

For the current block count, may end up sticking that counter (edit: probably more of a transaction ID) in ES, since the contents of ES would be the reason for the Indexer to be tracking this number.

Sounding good. Will take a closer look / merge in the AM.

yusefnapora commented 7 years ago

@autoencoder the thing is, we can't easily do a simple block height counter without changing the transactor RPC api to include that. At the moment we just have a ref to the block, which isn't ordered at all.

We could keep our own count on the client, but that would get tricky with the partial catchup, etc. Adding the block height or some kind of sequence number to the API probably makes sense

yusefnapora commented 7 years ago

actually, I guess we could pull the index of the first entry in the block, and use that as a sequence number... will think about that some more

autoencoder commented 7 years ago

Ideally it'd be an ID which could be later used to identify exactly what position of which fork the chain was on... and then if the API caller later tries to resume from a position that was on an abandoned fork, the client API would the replay the necessary inserts / updates / deletes into the Indexer to get it back in sync with the proper fork. Something like that.

Maybe not needed yet.

vyzo commented 7 years ago

I echo @parkan for having a lock/pid file for the local block cache. Btw, is rocksdb safe for concurrent processes?

autoencoder commented 7 years ago

Ok, took a look. Looks good. Noting some of the parts still WIP:

yusefnapora commented 7 years ago

yeah, it probably does make more sense for us to track the current block in rocksdb in the client code. I can set that up and see about exposing the index so we can use it as our "block height".

We could do multi-process catchup by opening the block cache in read-only mode; that's a good idea. It would need a little bit of extension to the current BlockCache api. Right now the BlockCache is just a read-through cache that doesn't keep track of block ordering, etc. So there's no way to "seek" to a particular block; instead we're always walking back through the chain from the current block. But I think if we track the block index numbers we should be able to spawn multiple processes that each take a range of blocks.

We'll also need to track the particular blockchain that the blocks are part of; it would be nice if each chain had a unique id or genesis block ref or something that we could identify the chain with. Right now the block cache will store blocks from any chain, since it's just a K/V store. But if we want to keep track of the structure / sequence of blocks, we also need to differentiate between different chains and either store them in separate rocskdb instances or prefix the keys, etc.