only download remote thumbnails if not already cached

mediachain / mediachain-indexer

search, dedupe, and media ingestion for mediachain

33 stars 14 forks source link

only download remote thumbnails if not already cached #29

Closed yusefnapora closed 7 years ago

yusefnapora commented 7 years ago

This uses the open_binary_asset fn from https://github.com/mediachain/mediachain-client/pull/83 to get a file-like stream of the thumbnail asset, if it exists in a blockchain-sourced record.

It will only pull the asset if we haven't already cached the requested size of an image with the same uri. Changes the cache_image function a bit, replacing the image_base64 param with an image_open_fn that will return the file-like object, and an image_origin that should have the uri. Saves the md5 of the uri to disk to see if we've already cached it.

Also fixes up a bug that was causing only the first size in do_sizes to be written out to disk.

Seems to work well on my VM, tailing from the records in the testnet.

autoencoder commented 7 years ago

Edit: yusef now I see why you used the hash of the URLs - to allow lazy loading that only calls the function for uncached images. I'll push an alternate solution in a few minutes, but this one requires that the caller already have the expected hash of the image.

Merging and going to add a few more features on top:

Switching back to content-based hashing.
Add support for streaming downloads.
(maybe) Intermediate solution for sha256 vs md5: A weak hash used for keying in order to limit RAM usage, maybe an even weaker hash than md5, and then a second stronger hash (maybe 2 hashes) recorded in the artifact metadata which gets checked when a cached item is requested.
Switching back the ordering of writing / checking the fn_h and fn_cache files. Seems the new ordering isn't as close to being atomic - because fn_h is what's checked at lookup time, but now fn_h is written before fn_cache at write time?

Wishlist:

Later we're adding some kind of batched, 2-step request API to the Client reader, so that these requests aren't heavily latency-bound, right?

Note: still an unfinished draft function, sketched in 10 min the beach at 3am. Not yet to be considered entirely prod-ready. Will keep track of it to be sure it gets fully polished up when I return.

yusefnapora commented 7 years ago

Yeah, that's why I was using the urls :) I think having the writer download and hash the image is reasonable, and we can include those in the asset info metadata. Our performance troubles with the bulk ingestion was more due to ipfs than the image downloads. So let's try having the writer include a hash or two with the url. The ipfs performance seems much better in the latest release candidate, although I had some stability issues. Hopefully soon we'll be able to reliably put images into ipfs at write time, and we can skip the http downloads in the client / indexer entirely.

The batched downloads would be great :) I think it's doable now if you hang on to the contents of meta.data.thumbnail, which is the dictionary that has the uri (and ipfs link if it exists). Then you could spin up some threads that call mediachain.reader.api.open_binary_asset to do the downloads.

parkan commented 7 years ago

I'm down with a weak (murmur or similar) hash.

Gonna think about this pipeline a bit, I feel like we're overengineering it