store binary assets (images) with ipld-style link properties

yusefnapora commented 8 years ago

This changes the way we write out image links. Instead of just writing the link directly, it's embedded in a wrapper object that has binary_asset: true set, and also has the origin url and mime type (if possible).

I changed the getty translator and iterator to output all images it can find (so thumb, comp, and preview), all stored in the meta.data.images map. so e.g:

"images": {
        "comp": {
          "origin": "http://cache1.asset-cache.net/gc/JD6484-001-animal-supervision-gettyimages.jpg?v=1&c=IWSAsset&k=2&d=OA3owSIxYcZu6CTE0v2Nq7xkTFD5%2fH8AEc%2bV4t9URzTj3xVJ2%2bN47LqFBOaJ9O2H&b=NDk1",
          "binary_asset": true,
          "link": {
            "@link": "QmR52eufw5m5DqzSDfRPSGR8pUC4KHyxSDiBia265je2jX"
          },
          "mime_type": "image/jpeg"
        },
        "preview": {
          "origin": "http://cache1.asset-cache.net/gp/JD6484-001.jpg?v=1&c=IWSAsset&k=3&d=OA3owSIxYcZu6CTE0v2Nqz%2btE2GTkH5ibg6NPo%2f0ON0d8HvCv9j88vapnBheEuHz&b=ODY=",
          "binary_asset": true,
          "link": {
            "@link": "QmZtMAvxNrVwM7ZQPR2y3NTNN3ci7DH3qvVQMJCgGTLKVA"
          },
          "mime_type": "image/jpeg"
        },
        "thumb": {
          "origin": "http://cache1.asset-cache.net/xt/JD6484-001.jpg?v=1&g=fs1|0|HG|64|841&s=1&b=RjI4",
          "binary_asset": true,
          "link": {
            "@link": "QmSqVFPygCsb9DRJuFQbYkeKfmCM2Pdv9jEqRASk2197K9"
          },
          "mime_type": "image/jpeg"
        }
      },

It might make more sense to use a list instead of a map, to allow unnamed items, etc. That will need some changes to the writer code tho.

This also adds an .open method to the IpfsDatastore and RpcDatastore classes, which returns a file-like object with the contents of the returned object.

I haven't updated the reader api yet, except to remove the old thumbnail_base64 code. This means that field is no longer present, so if the indexer is counting on it existing, this may break things.

The indexer could potentially use this new format as-is by walking the meta.data dict and looking for binary_assets. But it may also be worth updating the reader api to do that and collect the binary links in one place for easier iteration.

parkan commented 8 years ago

I like it! List may be a good idea too, with names (when present) moving into the objects themselves

parkan commented 8 years ago

Noting this as a protocol-breaking change we should get out the door soon

yusefnapora commented 8 years ago

This has been updated, with the getty stuff removed, and we now write out the asset dictionary even if we don't have the image data and can't create an ipfs link. In other words, if you run with --skip-image-downloads, it will just output a dict like

{
    "uri": "http://cache1.asset-cache.net/gc/JD6484-001-animal-supervision-gettyimages.jpg?v=1&c=IWSAsset&k=2&d=OA3owSIxYcZu6CTE0v2Nq7xkTFD5%2fH8AEc%2bV4t9URzTj3xVJ2%2bN47LqFBOaJ9O2H&b=NDk1",
    "binary_asset": true,
    "mime_type": "image/jpeg"
}

I changed the origin field to uri, since that seems like it makes more sense when there's no ipfs link present.

I'm planning to add back the special-casing of the thumbnail key, which was removed in this branch a few commits ago. In other words, try to pull the image at read-time, either from ipfs or from the http uri if there's no ipfs link. If successful, hand off the base64 image data to the indexer. This will be compatible with the current indexer code, which expects a thumbnail_base64 key containing the image data.

At that point, I feel like we can go ahead and merge this to prepare for a bulk re-ingestion of the getty data (with --skip-image-downloads).

However, I'm also going to write up how we best handle image caching and adding ipfs links after the fact, which seems like it needs its own issue.

parkan commented 8 years ago

I'm planning to add back the special-casing of the thumbnail key, which was removed in this branch a few commits ago

I like the generic images with optional "thumbnail: true" property, is there some reason we don't like that?

yusefnapora commented 8 years ago

I like the generic images too; I added the special casing back because we seemed to be in a rush to get images into the indexer, and the quickest way was to give it the output it was already expecting. This branch still supports generic images (like the snippet in the first comment), but it looks for an object at the thumbnail key and outputs thumbnail_base64, since that's where the indexer is currently looking.

I'm planning to replace this with "two step" image fetching, so consumers of the canonical_stream don't need to pay the download cost. But that needs indexer support, so need to coordinate it with @autoencoder

yusefnapora commented 8 years ago

Going to merge this so we can re-ingest on the testnet

mediachain / oldchain-client

store binary assets (images) with ipld-style link properties #68