mediachain / oldchain-client

[DEPRECATED] old mediachain client experiments
MIT License
4 stars 2 forks source link

IPFS ingestion / seeding / image caching #80

Open yusefnapora opened 8 years ago

yusefnapora commented 8 years ago

So, we've discovered that it's too expensive to put the image download + ipfs add step into the main ingestion pipeline when doing large bulk ingestions. But, we still want to have images in the indexer, and we'd like for all images to eventually make their way to ipfs and have links from mediachain records to ipfs-hosted images.

partial solution:

At the moment, https://github.com/mediachain/mediachain-client/pull/68 addresses the inefficency by writing out the original image uri. If you run the ingestion with image downloads disabled, we can still get fast ingestions, and have the http uri for future reference.

On the read side, the stream that the indexer reads will include the thumbnail_base64 of the thumbnail data, which we get by resolving the ipfs link (if possible), or the http uri.

This has several problems:

I'd like to remove the image downloading from the reader api and canonical_stream entirely; that seems like the wrong place for it. Instead, the indexer should examine its own cache to see if it already has the image. If not, it can pull from either ipfs or the http uri. Once the indexer has the image it gets added to the cache.

If there's no ipfs link present in the artefact, the indexer can add the image to a queue, the consumer of which will write the image to ipfs and issue an update cell to add the link to the artefact's chain.

Or, instead of the indexer being in charge of writing out the ipfs links, we can have a separate process that tails the blockchain and tries to fetch images without ipfs links, upload to ipfs, and issue update cells. That process could maybe have read access to the indexer's image cache directory, so it could skip downloading the images if we already have them.

Doing the caching as described above will probably lead to some wasted disk, if the images are kept in a filesystem cache and also added to a local ipfs repo. It might be better to only store the cached images in ipfs, with a local key/value store of image_uri (or native_id) to ipfs hash. Then we could do an ipfs get when it's time to serve up the image, which should be fast since it's pulling from a local repo.

@parkan, @autoencoder, @denisnazarov, what do you guys think?

autoencoder commented 8 years ago

To understand the exact IPFS bottleneck -- is the problem being latency bound? Network bound? Are we trying to get more IPs to be uploading the images to IPFS, to increase overall throughput? Is this related to the lazy, just-in-time, image ingestion ideas we discussed before - BTW still think supporting something along those lines would be great.

+1 on figuring out whether we should have some kind of unified image cache, which multiple parts of the pipeline can access. Definitely several parts of the Indexer and Frontend pipelines need repeated access to the image content and are very sensitive to latency -- Proposing an answer to this is at the top of my priorities when I get back.

Agreed that delaying the downloading of the images until much later in the pipeline, during the client reader step, could end up being quite problematic:

My vote for when to ingest would be anything that occurs close-as-possible to ingestion time.

Keeping the source URL in the metadata for future reference -- Definitely. Generalizing from that idea: Perhaps we should allow multiple "raw input" source URLs? The typical scenario I'm seeing is that what we're doing at ingestion time is a join of 3+ records on a "photo_id" key. These records often each come from different URLs or API endpoints - e.g. photo metadata record, photo image content record, author metadata record, photo gallery metadata record, etc. So, why not separately record the raw content for each of those records, along with the URL that each of those records came from, and maybe also the "photo_id" key they're all being joined on?

vyzo commented 8 years ago

I like the delayed write + update cell approach.

parkan commented 8 years ago

One thing may help avoid data duplication is IPFS Fuse mount mode:

(dev) [Desktop]% ipfs add bill-gates-jpg.jpg                                              (arkadiy@molybdenum:~/Desktop)
added QmQEQCoGtn2L4DwJr9zXK1MrsaHpKRb3TzYvn4krfddV6d bill-gates-jpg.jpg
(dev) [Desktop]% file /ipfs/QmQEQCoGtn2L4DwJr9zXK1MrsaHpKRb3TzYvn4krfddV6d                (arkadiy@molybdenum:~/Desktop)
/ipfs/QmQEQCoGtn2L4DwJr9zXK1MrsaHpKRb3TzYvn4krfddV6d: JPEG image data, JFIF standard 1.01

There's no write support for /ipfs, only /ipns/local, but this lets us read directly from the IPFS levelDB without copying out the files. A rocksdb (or w/e) instance can serve to map into it. It could even hypothetically serve as a CDN origin, though there doesn't appear to be a way to list pinned files directly at the fs level (vs ipfs pin ls) and correspondingly no way to prevent the FS from transparently going out to the network to get a hash we don't have pinned. Hmm.

parkan commented 8 years ago

FWIW, reading overhead from IPFS Fuse FS seems reasonably low, avg of ~6ms to read headers on a JPEG file (based on hash) vs ~4.5ms for reading directly from disk (based on path)

parkan commented 8 years ago

@autoencoder I see your concerns about delaying, here are my thoughts:

parkan commented 8 years ago

Trying to think of it kind of like this? For the colocated (i.e. internal), optimistic case

XXXXXXXXXXXXXXXXX Writer XXXXXXXXXXXXXXXXXXXXXXXXX
X                                                X
  +----------+    +-----------+    +-----------+                           +--------+   +-------+   +--------+
  |Preprocess+---->Normalize  +---->Postprocess+------+---events-----------> Reader +--->Indexer+--->Frontend|
  |(stateful)|    |(stateless)|    |(stateful) |      |                    +---^--^-+   +---^--^+   +-----^--+
  +----------+    +-----------+    ++---------++      |                        |  |         |  |          |
                                    |         |       | +-----------+          |  |         |  |          |
                                    |         |       +->Transactors|          |  |         |  |          |
                                    |         |         +-----------+          |  |         |  |          |
                              +-----v---+ +---v--------+                       |  |         |  |          |
                              |Datastore| | Side cache |                       |  |         |  |          |
                              +-----+---+ +--+---------+                       |  |         |  |          |
                                    |        |                                 |  |         |  |          |
                                    |        +---------------------------------+------------+-------------+
                                    |                                             |            |
                                    +---------------------------------------------+------------+

and the general non-colocated case

XXXXXXXXXXXXXXXXX Writer XXXXXXXXXXXXXXXXXXXXXXXXX
X                                                X
  +----------+    +-----------+    +-----------+           +-----------+   +--------+   +--------+  +--------+
  |Preprocess+---->Normalize  +---->Postprocess+--events--->Transactors+---+ Reader +---+Indexer|---+Frontend|
  |(stateful)|    |(stateless)|    |(stateful) |           +-----------+   +-^-+--^-+   +---^-^--+  +-----^--+
  +----------+    +-----------+    ++----------+                             | |  |         | |           |
                                    |                                        | |  |         | |           |
                                    |                         +--------------+ |  |         | |           |
                                    |                         |                |  |         | |           |
                              +-----v---+                     |         +------v--+--+      | |           |
                              |Datastore+---------------------+         | Side cache +------+-------------+
                              +-----+---+                               +------------+        |
                                    |                                                         |
                                    |                                                         |
                                    |                                                         |
                                    +---------------------------------------------------------+
vyzo commented 8 years ago

@parkan +1 for the pipeline issues.

In addition, with the direction we are going in the next phase implementation, I think that it is incorrect to ingest images prior to a transaction being comited to a block. The streaming model will change as well to only emit committed blocks (instead of individual journal entries), so there is no benefit in having the image available immediately anyway.

parkan commented 8 years ago

I think the "wild" case and rapid block generation (ETH levels) this def makes sense.

In the "colocated" case we can cheat a bit and optimistically ingest/postprocess things that we are reasonably confident will land in the chain (though there's technically an unbounded number of times you can land in the wrong fork, right? this is clearly not a huge practical problem, but I wonder how this works)