URIs in API Requests - Githubissues

j0sh commented 5 years ago

Currently, the verification API specifies the data for the source and transcoded renditions as a URI, which the API then downloads prior to verification. This is problematic for several reasons:

The broadcaster does not (currently) know the IP addresses that it is accessible by, so it cannot construct the host portion of its URL (and may not even know its own externally visible port if NAT'd/proxied).
This assumes bidirectional connectivity [1] is available between the broadcaster and verifier. In practice, this may not always be easily done, either due to user preference (B runs on localhost by default), security policy, or logistical complexities introduced by the network environment, eg running the broadcaster under Docker. Preferably, request flows would remain unidirectional.

[1] Bidirectional connectivity where both B / V make requests to one another:

B -- verification request --> V
B <----- video request ----- V

I was able to work around this issue for my local testing by doing a few things:

Assuming the verifier API container runs on the same local machine
Hard-coding a 127.0.0.1 host to URIs in outgoing requests
Adding a -network="host" argument to the docker run command in order to expose the API container to localhost.

Obviously this is not workable for the general case, so we should settle on ways to address this issue. The three ways I can see are:

Make the broadcaster aware of its own IP address by forcing the user to configure it via CLI flags at startup time.
Use an external object store to store segments, which is guaranteed to be publicly reachable.
Directly push the binary data to the verifier rather than sending a URL to the data.

Each option is described below, but the third one (pushing binary data) gets my vote.

Make the broadcaster aware of its own IP address by forcing the user to configure it via CLI flags at startup time.

This does not require changes to the verifier, only the go client. However, there are a number of drawbacks to this approach:

Fixates on the requirement of bidirectional connectivity between B/V, when a unidirectional flow is preferable
Offers a subjectively worse UX for users due to increased configuration requirements
May substantially increase the surface area of the broadcaster that's exposed to the world (currently, it only only listens to localhost by default).
Running the broadcaster under environments such as Docker may make this extra-challenging, where additional considerations may be required to make the broadcaster aware of its own IP in order to announce it to the verifier.

This feels disproportionate simply to satisfy verification requirements, at least compared to the alternative solutions.

Note that our transcoders do have a public IP attached in order for broadcasters to discover and connect to them, but users are guided through this process, and it is mostly a set-and-forget operation for them. This has also shown itself to be somewhat tricky to handle in our containerized testing environment.

Use an external object store to store segments, which is guaranteed to be publicly reachable.

This is the easiest method for now, since the broadcaster already has support for Amazon S3 and Google Cloud Storage. No changes would be required to the go client or the verifier to support this. However, in the long term, there are a few issues with depending on cloud storage to transport segments from broadcaster to verifier:

Not all users will have cloud storage accounts. This would be an additional barrier to entry for those users.
Mandatory centralized storage is incongruous with the decentralized ethos that verification is striving for.
May greatly increase costs for users, especially if video needs to cross an egress boundary.
The security and ops story around cloud storage and Livepeer is not completely developed yet. There are still tradeoffs to be analyzed and communicated around the topic of granting some bucket permissions to potentially untrusted actors.

Directly push the binary data to the verifier rather than sending a URL to the data.

This is the method that makes the least assumptions about the environment and requires the least amount of setup by the user. However, it does require the most changes to the verifier since it changes how things are serialized. There are a couple of ways I can see for us to do this:

Base64 the video files and send them inline in the JSON. Incredibly inefficient and ugly, but relatively "easy".
Post the request as a multipart/form-data or multipart/mixed, with the JSON parameters as one form field, and referring to the uploaded form-field names where we currently have URIs. Something like this:

Content-Type: multipart/form-data; boundary=asdf
Content-Length: ...

--asdf
Content-Disposition: form-data; name="source"
Content-Type: video/mp2t

... binary data for source ...

--asdf
Content-Disposition: form-data; name="rendition1"
Content-Type: video/mp2t

... binary data for first rendition ...

--asdf
Content-Disposition: form-data; name="rendition2"
Content-Type: video/mp2t

... binary data for second rendition ...

--asdf
Content-Disposition: form-data; name="parameters"
Content-Type: application/json

{
   "source":"source",
   "renditions":[
      {
         "uri":"rendition1"
      },{
        "uri":"rendition2"
      }
   ],
   "orchestratorID":"foo",
}
--asdf--

yondonfu commented 5 years ago

I was able to work around this issue for my local testing by doing a few things

Another workaround when the verifier is running on the same machine as the broadcaster could be for the broadcaster to write the source + rendition data to files in the mounted volume used by the verifier Docker container and then the URIs included in the request would be the location of the files in the mounted volume i.e. /stream/source.ts. Of course this only helps in the situation where the broadcaster and verifier are on the same machine.

For setups where the broadcaster and verifier could be on separate machines, I agree that directly pushing binary data to the verifier seems like the best option out of the ones outlined in the post. I favor using the multipart/form-data request rather than including the base64 encoded in the JSON request (the easier path, but I don't see a scenario where we would actually stick with this as a solution).

May greatly increase costs for users, especially if video needs to cross an egress boundary

Good point. Leveraging an existing external object store is appealing if the broadcaster is already using one, but since the decision to also use the object store for fetching data for the verifier will require reasoning about these egress costs as well as setting up proper security policies so the verifier can fetch the data I think directly pushing binary data in the short term makes sense.

j0sh commented 4 years ago

Leveraging an existing external object store is appealing if the broadcaster is already using one

On that note, it would be nice if we also preserved the current ability to pass in URLs in addition to pushing binary data, because having the verifier pull from object store may still be useful some day.

The current JSON format could be kept, and signaled by simply sending JSON (same way we do now). We'd only send a multipart request if we were pushing binary data. There are many possible approaches for doing this though; I'm OK with whichever is simplest to implement.

ndujar commented 4 years ago

Another option is to avoid downloading the file. OpenCV VideoCapture allows for directly passing a URL as argument, instead of a file in the filesystem. It basically streams it into the VideoCapture object. This has the added advantage of not having to wait for completion of download, as OpenCV . If the verifier is in the same machine as the broadcaster, it could simply provide a localhost port to attach to in order to extract the source.

This requires a bit or modification of the verifier API, but probably is for better as it simplifies the code to some extent.

j0sh commented 4 years ago

OpenCV VideoCapture allows for directly passing a URL as argument, instead of a file in the filesystem

Wouldn't we still encounter the original issues of requiring bidirectional connectivity between the broadcaster and verifier, and that the broadcaster needs to know its own IP address relative to the verifier?

Using a shared volume for verification works well enough, even for remote verifiers. The broadcaster just needs the (remote) volume locally mounted, eg via sshfs, NFS or some such. With that in place, this issue is less urgent, although it would certainly be nice to resolve at some point. See https://github.com/livepeer/go-livepeer/commit/1aca449a6927f34c9090b7fe8011028753a97262

ndujar commented 4 years ago

Wouldn't we still encounter the original issues of requiring bidirectional connectivity between the broadcaster and verifier, and that the broadcaster needs to know its own IP address relative to the verifier?

I am not very savvy on this matter, but how do we solve the problem in the case of the orchestrators? How do they retrieve the job to be transcoded? Couldn't the verifier be considered as another orchestrator from the perspective of the broadcaster, only that instead of transcoded video it returns the .json response?

Using a shared volume for verification works well enough, even for remote verifiers. The broadcaster just needs the (remote) volume locally mounted, eg via sshfs, NFS or some such. With that in place, this issue is less urgent, although it would certainly be nice to resolve at some point. See https://github.com/livepeer/go-livepeer/commit/1aca449a6927f34c9090b7fe8011028753a97262

When a URL is passed as argument, OpenCV (ffmpeg backend) parses it as a stream, hence reducing delivery time because there is no need to complete the download. The only disadvantage I see to sharing volume is that verifier needs to wait until file is transcoded and then downloaded to the shared volume so it can start extracting features. This opens two scenarios:

Remote verification: The retrieval of the source video may happen while waiting for orchestrators to be done, so it is not a problem. Once the pool of transcodifications is ready to be verified, they need to be downloaded to the shared volume. Depending on the size of the files, this can be costly.
Local verification: Here, the retrieval of the source is even less of a problem because it is locally available in the shared volume. For dealing with the transcodifications, situation is the same as above. Am I missing something here?

j0sh commented 4 years ago

how do we solve the problem in the case of the orchestrators? How do they retrieve the job to be transcoded?

The broadcaster pushes the segments to the orchestrator, rather than the orchestrator pulling them - which is also the preferred approach to solve the problem here.

The only disadvantage I see to sharing volume

Yeah, shared volumes are not the best long-term solution, but just wanted to mention they are a satisfactory workaround for the moment. We can revisit this issue when we start hitting the operational limits of shared volumes.

yondonfu commented 4 years ago

Closed by #114

livepeer / verification-classifier

URIs in API Requests #64

Make the broadcaster aware of its own IP address by forcing the user to configure it via CLI flags at startup time.

Use an external object store to store segments, which is guaranteed to be publicly reachable.

Directly push the binary data to the verifier rather than sending a URL to the data.