IPFS ingestion / seeding / image caching

yusefnapora commented 8 years ago

So, we've discovered that it's too expensive to put the image download + ipfs add step into the main ingestion pipeline when doing large bulk ingestions. But, we still want to have images in the indexer, and we'd like for all images to eventually make their way to ipfs and have links from mediachain records to ipfs-hosted images.

partial solution:

At the moment, https://github.com/mediachain/mediachain-client/pull/68 addresses the inefficency by writing out the original image uri. If you run the ingestion with image downloads disabled, we can still get fast ingestions, and have the http uri for future reference.

On the read side, the stream that the indexer reads will include the thumbnail_base64 of the thumbnail data, which we get by resolving the ipfs link (if possible), or the http uri.

This has several problems:

the cost of downloading the images is pushed to all consumers of the canonical_stream. Opting out is possible if we make a small tweak to the canonical_stream method, but I haven't done that yet.
the images aren't cached on the indexer, except in its local ipfs cache if that's where we're reading from. But in the case of http uris, we're potentially downloading the same image multiple times.
If the original writer didn't add to ipfs, it never gets into ipfs at all
only the thumbnail field is treated specially, and it's an awkward hack
better solutions:

I'd like to remove the image downloading from the reader api and canonical_stream entirely; that seems like the wrong place for it. Instead, the indexer should examine its own cache to see if it already has the image. If not, it can pull from either ipfs or the http uri. Once the indexer has the image it gets added to the cache.

If there's no ipfs link present in the artefact, the indexer can add the image to a queue, the consumer of which will write the image to ipfs and issue an update cell to add the link to the artefact's chain.

Or, instead of the indexer being in charge of writing out the ipfs links, we can have a separate process that tails the blockchain and tries to fetch images without ipfs links, upload to ipfs, and issue update cells. That process could maybe have read access to the indexer's image cache directory, so it could skip downloading the images if we already have them.

Doing the caching as described above will probably lead to some wasted disk, if the images are kept in a filesystem cache and also added to a local ipfs repo. It might be better to only store the cached images in ipfs, with a local key/value store of image_uri (or native_id) to ipfs hash. Then we could do an ipfs get when it's time to serve up the image, which should be fast since it's pulling from a local repo.

@parkan, @autoencoder, @denisnazarov, what do you guys think?

autoencoder commented 8 years ago

To understand the exact IPFS bottleneck -- is the problem being latency bound? Network bound? Are we trying to get more IPs to be uploading the images to IPFS, to increase overall throughput? Is this related to the lazy, just-in-time, image ingestion ideas we discussed before - BTW still think supporting something along those lines would be great.

+1 on figuring out whether we should have some kind of unified image cache, which multiple parts of the pipeline can access. Definitely several parts of the Indexer and Frontend pipelines need repeated access to the image content and are very sensitive to latency -- Proposing an answer to this is at the top of my priorities when I get back.

Agreed that delaying the downloading of the images until much later in the pipeline, during the client reader step, could end up being quite problematic:

For example, almost all of our 11+ datasets don't even provide any 3rd-party-hosted, long-lived, high resolution image URLs. Lots of the high-res URLs we're getting access to, including the higher resolution "Getty Embed" URLs that we'll be switching to soon for the Getty dataset, rely on short-lived URL query arguments and session cookies. (Try to access them much later and you'll get 403 forbidden.) In fact, I'm not aware of a single data source, out of the 11 we have running so far, that provides durable URLs for the high-res versions of the images. All higher-res images in these datasets are protected with short-lived tokens / arguments, which prevent later downloading / hot-linking.
Delaying the download prevents us from quickly feeding permanent errors in the image download process back up to the initiator of the ingestion, so the person who initiated the ingestion can rapidly fix obvious problems.
Delaying the download could open up the possibility of certain types of attacks, where the ingester thinks he's signing facts about one image, but the image gets swapped for another by a third party?

My vote for when to ingest would be anything that occurs close-as-possible to ingestion time.

Keeping the source URL in the metadata for future reference -- Definitely. Generalizing from that idea: Perhaps we should allow multiple "raw input" source URLs? The typical scenario I'm seeing is that what we're doing at ingestion time is a join of 3+ records on a "photo_id" key. These records often each come from different URLs or API endpoints - e.g. photo metadata record, photo image content record, author metadata record, photo gallery metadata record, etc. So, why not separately record the raw content for each of those records, along with the URL that each of those records came from, and maybe also the "photo_id" key they're all being joined on?

vyzo commented 8 years ago

I like the delayed write + update cell approach.

parkan commented 8 years ago

One thing may help avoid data duplication is IPFS Fuse mount mode:

(dev) [Desktop]% ipfs add bill-gates-jpg.jpg                                              (arkadiy@molybdenum:~/Desktop)
added QmQEQCoGtn2L4DwJr9zXK1MrsaHpKRb3TzYvn4krfddV6d bill-gates-jpg.jpg
(dev) [Desktop]% file /ipfs/QmQEQCoGtn2L4DwJr9zXK1MrsaHpKRb3TzYvn4krfddV6d                (arkadiy@molybdenum:~/Desktop)
/ipfs/QmQEQCoGtn2L4DwJr9zXK1MrsaHpKRb3TzYvn4krfddV6d: JPEG image data, JFIF standard 1.01

There's no write support for /ipfs, only /ipns/local, but this lets us read directly from the IPFS levelDB without copying out the files. A rocksdb (or w/e) instance can serve to map into it. It could even hypothetically serve as a CDN origin, though there doesn't appear to be a way to list pinned files directly at the fs level (vs ipfs pin ls) and correspondingly no way to prevent the FS from transparently going out to the network to get a hash we don't have pinned. Hmm.

parkan commented 8 years ago

FWIW, reading overhead from IPFS Fuse FS seems reasonably low, avg of ~6ms to read headers on a JPEG file (based on hash) vs ~4.5ms for reading directly from disk (based on path)

parkan commented 8 years ago

@autoencoder I see your concerns about delaying, here are my thoughts:

For end users (going through distributed flow), the expectation of fully online processing of a potentially unlimited number of potentially very large assets (legit gigapixel archival images, not even talking about tarpits and IM exploits), under load, etc is probably not realistic so we should not make that guarantee. This also avoids issues with rejected/retried transactions (i.e. process only large images from transactions in confirmed blocks)
For the "colocated" flow that doesn't have to wait for confirmation and can saturate the pipeline, these are more appropriate expectations.
In the case of rapidly expiring CDN links, we should consider whether they're rapidly expiring for a reason and what the expiration timeframes are

parkan commented 8 years ago

Trying to think of it kind of like this? For the colocated (i.e. internal), optimistic case

XXXXXXXXXXXXXXXXX Writer XXXXXXXXXXXXXXXXXXXXXXXXX
X                                                X
  +----------+    +-----------+    +-----------+                           +--------+   +-------+   +--------+
  |Preprocess+---->Normalize  +---->Postprocess+------+---events-----------> Reader +--->Indexer+--->Frontend|
  |(stateful)|    |(stateless)|    |(stateful) |      |                    +---^--^-+   +---^--^+   +-----^--+
  +----------+    +-----------+    ++---------++      |                        |  |         |  |          |
                                    |         |       | +-----------+          |  |         |  |          |
                                    |         |       +->Transactors|          |  |         |  |          |
                                    |         |         +-----------+          |  |         |  |          |
                              +-----v---+ +---v--------+                       |  |         |  |          |
                              |Datastore| | Side cache |                       |  |         |  |          |
                              +-----+---+ +--+---------+                       |  |         |  |          |
                                    |        |                                 |  |         |  |          |
                                    |        +---------------------------------+------------+-------------+
                                    |                                             |            |
                                    +---------------------------------------------+------------+

and the general non-colocated case

XXXXXXXXXXXXXXXXX Writer XXXXXXXXXXXXXXXXXXXXXXXXX
X                                                X
  +----------+    +-----------+    +-----------+           +-----------+   +--------+   +--------+  +--------+
  |Preprocess+---->Normalize  +---->Postprocess+--events--->Transactors+---+ Reader +---+Indexer|---+Frontend|
  |(stateful)|    |(stateless)|    |(stateful) |           +-----------+   +-^-+--^-+   +---^-^--+  +-----^--+
  +----------+    +-----------+    ++----------+                             | |  |         | |           |
                                    |                                        | |  |         | |           |
                                    |                         +--------------+ |  |         | |           |
                                    |                         |                |  |         | |           |
                              +-----v---+                     |         +------v--+--+      | |           |
                              |Datastore+---------------------+         | Side cache +------+-------------+
                              +-----+---+                               +------------+        |
                                    |                                                         |
                                    |                                                         |
                                    |                                                         |
                                    +---------------------------------------------------------+

vyzo commented 8 years ago

@parkan +1 for the pipeline issues.

In addition, with the direction we are going in the next phase implementation, I think that it is incorrect to ingest images prior to a transaction being comited to a block. The streaming model will change as well to only emit committed blocks (instead of individual journal entries), so there is no benefit in having the image available immediately anyway.

parkan commented 8 years ago

I think the "wild" case and rapid block generation (ETH levels) this def makes sense.

In the "colocated" case we can cheat a bit and optimistically ingest/postprocess things that we are reasonably confident will land in the chain (though there's technically an unbounded number of times you can land in the wrong fork, right? this is clearly not a huge practical problem, but I wonder how this works)

mediachain / oldchain-client

IPFS ingestion / seeding / image caching #80

partial solution:

better solutions: