IPFS support - Githubissues

cyberj0g commented 1 year ago

General considerations

We are not willing to run our own IPFS nodes at this point, therefore, we need to upload (pin) the files to cloud provider, like [Pinata],(https://www.pinata.cloud/) through their proprietary API. To ensure the cloud provider didn't tamper with the contents of the file, we probably want to add an extra step of calculating the content hash locally and comparing it with the address, returned after upload.
Files could be accessed by HTTP through public or private gateway providers. To operate in a trustless manner, the content hash should be calculated and matched against the returned CID. The minimal unit, which can be validated, is an entire file. An example of IPFS URL, which uses public gateway.
With content-based addressing, the target URL is not input, but output parameter, because it includes content hash. It may require a slightly different flow, compared to S3. It also may present several challenges with playlist generation e.g. a segment should be fully uploaded, before playlist entry can be added; playlist URL will change every time it's updated. IPNS and DNSLink are mechanisms which could be used to address that.
VOD

Input
IPFS URLs do not explicitly contain file type. Most widely used IPFS gateway implementation automatically sets Content-type header in HTTP GET response, we can use that, but it's not a mandatory requirement for gateway implementations.
Output
ipfs:// URL support should be added to go-tools and catalyst-uploader. It will use Pinata API to pin the file and return Pinata IPFS gateway URL. Content hash matching should be implemented as well.
when segment is uploaded, Mist needs to use URL returned from catalyst-uploader in the playlist
TBD: what should be the playlist behavior? It seems that IPNS is not available on Pinata. It seem to support directory wrapping, with pretty obscure API, which may allow to address the file by name.
Live
The main issue is dynamic playlist files
TBD

victorges commented 1 year ago

Agree with almost everything! Some comments:

To ensure the cloud provider didn't tamper with the contents of the file, we probably want to add an extra step of calculating the content hash locally and comparing it with the address, returned after upload.

I don't think this is a super strict requirement. We upload things to Google S3 and don't re-download the file to check that it is actually the file we uploaded. Piñata is a much smaller provider, but if we are building our service on top of them we should probably trust them as a service provider.

We could still do the pre-calculation of the CID for other reasons though, like giving a CID to every asset even if it's not saved on IPFS which can allow for some more homogenous use of the CIDs as identifiers. I still wouldn't put this as a requirement for this first integration though.

VOD Input

Since we are implementing this through an IPFS-based Object Store type, in https://github.com/livepeer/go-tools/issues/3, I think the input should be just a specific OS URL like pinata://key:pwd@pinata.cloud.
Could maybe add a name for the pin in the path, tho that could be confusing since the name is not a unique identifier/path for the file, so it's fine if we just give it an auto-generated name from some other info. Maybe a querystring? That'd be a first on the object store URLs.
In the current pipeline, we name the pins asset_<playbackId> and metadata_<playbackId>, and that has been already useful for bulk processing of pinned files every now and then. If we can include some ID in the name which can be used to find the asset later in the API, that'd be useful.

VOD output

ipfs:// URL support should be added to https://github.com/livepeer/go-tools/issues/3 and catalyst-uploader.

I don't think catalyst-uploader needs to support ipfs:// URLs. go-tools might in theory, but in practice it won't really be a requirement since it will only be uploading files, not downloading them (especially if we only download them with gateway URLs).

Here I just want to make a clear distinction that a "livepeer-defined URL to represent an IPFS pinning service as an Object Store" should not use the ipfs:// scheme but anything else like pinata://. That since ipfs:// is a part of the official IPFS protocol, used to reference and read files through their content-hash, so we should not mix up the 2.

So IMO go-tools and catalyst-uploader will need support for an "IPFS-based Object Store", but they won't really need support for ipfs:// URLs.

It will use Pinata API to pin the file and return Pinata IPFS gateway URL.

It will be more useful to get the IPFS CID or ipfs:// URL back, so we don't need to parse gateway URLs to get the CID. On the playlists it does make sense to use a gateway URL tho for them to be supported in regular browsers. If we need to return the gateway URLs to Studio it's not a huge issue either, it's OK to parse the gateway URLs. Just a soft prefence to get the raw IPFS URL or CID instead.

Btw we also have our own branded IPFS gateway through Piñata, under ipfs.livepeer.com (and maybe .studio as well, not sure). But that will only work through our "built-in" object store, so not necessarily we should use that for every file saved on IPFS, but also not sure how do we pass that to catalyst somehow. Maybe we could use the Object Store hostname for the host that should be used for the gateway? Feels a little weird, but would be like pinata://key:pwd@ipfs.livepeer.com or if we just wanna use Cloudflare's gateway pinata://key:pwd@cloudflare-ipfs.com.

TBD: what should be the playlist behavior? It seems that IPNS is not available on Pinata. It seem to support directory wrapping, with pretty obscure API, which may allow to address the file by name.

For VOD I think it's fine if all the files aren't in the same IPFS directory. Can just have a playlist file pointing to other independent files on IPFS as well, and it's possible cause we can just store the playlist file after everything else.

Much trickier for livestreams indeed, but I'd argue that it doesn't make sense for Livestreams anyway if it is a "content address" that both is not permanent and changes all the time. Might as well have dynamic playlists on that case and only the segments stored on IPFS (if we ever do want IPFS-based playback).

I'd also say not to spend a lot of time on this. IPFS playback is not practical right now, and even tho they are getting better we should focus on what works today. So IMO starting with only the original "MP4" files on IPFS is enough (and that's all that we have on Studio today as well, apart from NFT metadata we won't be handled by Catalyst anyway).

victorges commented 1 year ago

More concrete examples for input/output:

Input POST /api/vod

{
"url": "https://storage.google.cloud/my-bucket/my-file.mp4",
"output_locations": [
    {
        "type": "object_store",
        "url": "pinata://key:pwd@ipfs.livepeer.com?name=asset_12345" 
        // pinata_access_key field is not necessary. it's embedded in the URL
        "outputs": {"source_mp4": true}
    }
]
}

(omitted unrelated stuff. hostname and querystring of the object store URL are also very debatable, I don't love them myself, but I'm including everything I said in the example for completeness)

Output:

{
"status": "completed",
"outputs": [
{
  "type": "object_store"
  "manifest": "bafybeien324vbmmtfwe6nuiyfogs3lka3x4mo2rwm32te2ajlfvaeslk7y",
  "videos": [
    {
       "type": "not sure I know what this type means",
       "size": 12345,
       "location": "https://ipfs.livepeer.com/ipfs/bafybeien324vbmmtfwe6nuiyfogs3lka3x4mo2rwm32te2ajlfvaeslk7y"
     }
   ]
 }
]
}

Just noticed that we have both this manifest and videos[].location fields, so we could put the CID on the manifest and the gateway URL on the location! WDYT? Does that manifest mean something else that wouldn't be compatible with having the CID there?

Also a side note, we might need to rethink this outputs schema, since there's no reference there about what output_location from the request it refers to. If we have multiple object_store outputs then they're indistinguishable there. Could be just including the original outuput_location URL in there, tho that disallows multiple exports to the same OS, which we might need. Perhaps having the contract of always listing them in the same order? Not sure.

cyberj0g commented 1 year ago

Thanks for very useful input @victorges. Finally, I got some idea how it should work on catalyst-api side. It makes sense to focus on single-file VOD first, and, if later we will need to implement HLS and live streaming, we already have initial research documented here.

Let's return both CID and full gateway URL from catalyst-api, and, maybe, only CID from go-tools to not suggest a specific gateway.

On naming, folder wrapping should work fine for immutable content, to have the gateway URL ending with file name, I'll implement that.

if we are building our service on top of them we should probably trust them as a service provider

You are probably right, we can trust the provider at this stage. However, I believe, that the ultimate goal is to provide a fully verifiable trustless flow for users who need that. Also, when low latency streaming is implemented on B-O-T, we'll open the path to per-video-packet verification. It will likely require streaming verification on the storage side as well. Maybe @yondonfu could chime in on that.

yondonfu commented 1 year ago

Chiming in here.

I'd also say not to spend a lot of time on this. IPFS playback is not practical right now, and even tho they are getting better we should focus on what works today. So IMO starting with only the original "MP4" files on IPFS is enough (and that's all that we have on Studio today as well, apart from NFT metadata we won't be handled by Catalyst anyway).

Focusing on using IPFS to persist source mp4 assets to match the status quo functionality in Studio first makes sense to me. As long as we have access to the source assets, we can always generate derived assets as is needed (i.e. a source HLS playlist, transcoded renditions, etc.).

You are probably right, we can trust the provider at this stage. However, I believe, that the ultimate goal is to provide a fully verifiable trustless flow for users who need that. Also, when low latency streaming is implemented on B-O-T, we'll open the path to per-video-packet verification. It will likely require streaming verification on the storage side as well.

In this case, I see two trust relationships:

The trust relationship b/w Catalyst and the IPFS gateway provider
The trust relationship b/w Catalyst and its user

For 1, in the short term, we should be able to trust reputable gateway providers. Later on, we may want to have more flexibility with using gateway providers where some of the providers are not trusted in which case we could look into verifiable retrieval from gateways in the Catalyst integration.

For 2, I think we can address this with the verifiable video/RMID work that we've been investigating - the basic idea being that the user calculates a unique hash ID for the raw media (i.e. video, audio, metadata tracks + relative timestamps) of an asset agnostic to the container, checks that this ID matches the one calculated by Studio/Catalyst and uses the ID to check the content returned for a request with the ability to check that the raw media is correct even if the response is a transmuxed version. And for the case where a transcoded rendition is returned there would be a signed attestation. This is being fleshed out for Q4!

livepeer / catalyst

IPFS support #154

General considerations

VOD

Input

Output

Live

VOD Input

VOD output