Closed yusefnapora closed 7 years ago
Edit: yusef now I see why you used the hash of the URLs - to allow lazy loading that only calls the function for uncached images. I'll push an alternate solution in a few minutes, but this one requires that the caller already have the expected hash of the image.
Merging and going to add a few more features on top:
Wishlist:
Note: still an unfinished draft function, sketched in 10 min the beach at 3am. Not yet to be considered entirely prod-ready. Will keep track of it to be sure it gets fully polished up when I return.
Yeah, that's why I was using the urls :) I think having the writer download and hash the image is reasonable, and we can include those in the asset info metadata. Our performance troubles with the bulk ingestion was more due to ipfs than the image downloads. So let's try having the writer include a hash or two with the url. The ipfs performance seems much better in the latest release candidate, although I had some stability issues. Hopefully soon we'll be able to reliably put images into ipfs at write time, and we can skip the http downloads in the client / indexer entirely.
The batched downloads would be great :) I think it's doable now if you hang on to the contents of meta.data.thumbnail
, which is the dictionary that has the uri (and ipfs link if it exists). Then you could spin up some threads that call mediachain.reader.api.open_binary_asset
to do the downloads.
I'm down with a weak (murmur or similar) hash.
Gonna think about this pipeline a bit, I feel like we're overengineering it
This uses the
open_binary_asset
fn from https://github.com/mediachain/mediachain-client/pull/83 to get a file-like stream of thethumbnail
asset, if it exists in a blockchain-sourced record.It will only pull the asset if we haven't already cached the requested size of an image with the same
uri
. Changes thecache_image
function a bit, replacing theimage_base64
param with animage_open_fn
that will return the file-like object, and animage_origin
that should have the uri. Saves the md5 of the uri to disk to see if we've already cached it.Also fixes up a bug that was causing only the first size in
do_sizes
to be written out to disk.Seems to work well on my VM, tailing from the records in the testnet.