mediachain / mediachain-indexer

search, dedupe, and media ingestion for mediachain
33 stars 14 forks source link

only download remote thumbnails if not already cached #29

Closed yusefnapora closed 7 years ago

yusefnapora commented 7 years ago

This uses the open_binary_asset fn from https://github.com/mediachain/mediachain-client/pull/83 to get a file-like stream of the thumbnail asset, if it exists in a blockchain-sourced record.

It will only pull the asset if we haven't already cached the requested size of an image with the same uri. Changes the cache_image function a bit, replacing the image_base64 param with an image_open_fn that will return the file-like object, and an image_origin that should have the uri. Saves the md5 of the uri to disk to see if we've already cached it.

Also fixes up a bug that was causing only the first size in do_sizes to be written out to disk.

Seems to work well on my VM, tailing from the records in the testnet.

autoencoder commented 7 years ago

Edit: yusef now I see why you used the hash of the URLs - to allow lazy loading that only calls the function for uncached images. I'll push an alternate solution in a few minutes, but this one requires that the caller already have the expected hash of the image.

Merging and going to add a few more features on top:

Wishlist:

Note: still an unfinished draft function, sketched in 10 min the beach at 3am. Not yet to be considered entirely prod-ready. Will keep track of it to be sure it gets fully polished up when I return.

yusefnapora commented 7 years ago

Yeah, that's why I was using the urls :) I think having the writer download and hash the image is reasonable, and we can include those in the asset info metadata. Our performance troubles with the bulk ingestion was more due to ipfs than the image downloads. So let's try having the writer include a hash or two with the url. The ipfs performance seems much better in the latest release candidate, although I had some stability issues. Hopefully soon we'll be able to reliably put images into ipfs at write time, and we can skip the http downloads in the client / indexer entirely.

The batched downloads would be great :) I think it's doable now if you hang on to the contents of meta.data.thumbnail, which is the dictionary that has the uri (and ipfs link if it exists). Then you could spin up some threads that call mediachain.reader.api.open_binary_asset to do the downloads.

parkan commented 7 years ago

I'm down with a weak (murmur or similar) hash.

Gonna think about this pipeline a bit, I feel like we're overengineering it