Closed kyriediculous closed 3 years ago
@kyriediculous You can't 'query' broadcaster on /metrics
endpoint - this endpoint is just a plain text representation of internal application counters in the moment. To do queries like avg_over_time
you will need to also spin up Prometheus instance that will be scraping data from broadcaster instance - then you can run queries against Prometheus.
I guess we can spin up a docker container for both broadcaster and a Prometheus instance for said broadcaster programmatically ?
Given that all this will be running inside k8s, it could be easier to spin up stream tester, prometheus and broadcaster inside one pod - so they will be always running. Then point broadcaster's orchWebhookUrl
to the stream tester and make stream tester to respond with needed orchestrators. That way there will be no need to spin up/down anything programmatically.
Thanks Ivan, I'm not very familiar with the webhooks in go-livepeer and never used them. I'll look into it because that seems like a better route to use then if we don't have to spin pods up everytime.
I suppose we can then use tags to filter the metrics we require (using NodeID or ManifestID ? )
Also one note to make is that the current specification for this considers the stream-tester as a general purpose tool and leaves fetching all orchestrators to run a benchmark for as an external operation if that's how a user or service wishes to use the stream tester.
i.e. We fetch the orchestrators on a different service and then provide the orchestrator we want to benchmark for in the POST request to the stream tester.
Alternatively we don't provide such an orchestrators
field in the request object. Instead we specify something like "leaderboard-benchmark" which would then instruct stream-tester to fetch all orchestrators and run a test stream for each of them using the orchWebhookUrl
. This feels very application specific though but would then only require a single cron-job per region.
I think the orchWebhookUrl could definitely be useful
One downside seems to be the 1 minute refresh interval which is currently hardcoded. We have the option to use a fork I guess or the alternative is just to use a 1 minute interval between test runs.
I guess the workflow would look something like this:
Prerequisite: stream-tester exposes an endpoint to get orchestrators /orchestrators
which will be provided to the broadcaster's -orchWebhookUrl
flag
livepeer -broadcaster ... -orchWebhookUrl http://stream-tester/orchestrators
/start_streams
with orchestrators
field populated /orchestrators
endpoint from the prerequisitesIIUC the following modifications for stream-tester are being proposed:
What if the scope of modifications for the stream-tester was reduced to just include the configurable orch webhook server with the rest of the functionality mentioned above implemented in a separate app ("orch-tester"?) in the cmd
package of this repo? This app would support the following:
[1] Coupled with the 1 minute wait before starting a test, this might sidestep the fact that latency/duration related metrics are not labeled with a manifestID or the orchestrator used right now. Those labels could be added, but there are negative Prometheus performance implications associated with high cardinality labels.
We should then be able to automate running the app at a regular interval as a k8s CronJob with the stream-tester, broadcaster and monitoring instances already running in the same k8s cluster as well.
I suggest the above because I wonder if there is an opportunity to compose stream-tester together with a broadcaster and a monitoring instance and push most of the additional functionality which seems specific for this leaderboard metrics collection use case into an additional app in the cmd
package.
Re: Leaderboard Backend
If we use a cloud-based database for a production environment then having a server that can be an intermediate query layer is far more scalable and puts less strain on the clients (user's web browser). Alternatively metrics can be queried in a serverless function.
Sounds like a good idea. Would each region make an API call to a serverless function after each test in order to store the results of the test in the cloud DB?
The API interface that would be exposed to a client and the aggregated stats looks good. Just to clarify - the score
and success_rate
fields in the response for /aggregated_stats
would be the average for the time period (as indicated by the since
field)?
Would each region make an API call to a serverless function after each test in order to store the results of the test in the cloud DB?
Each stream tester could write it directly to storage, but now that you mention it handing that off to a serverless function might actually be a better idea.
The API interface that would be exposed to a client and the aggregated stats looks good. Just to clarify - the score and success_rate fields in the response for /aggregated_stats would be the average for the time period (as indicated by the since field)?
That is correct.
Query the Livepeer subgraph for a list of all active orchestrators with their ETH addresses and their current service URI The Livepeer subgraph would need to be updated to index orchestrator service URI updates
It feels almost easier to just have this "orch-tester" service consume the go-livepeer/eth
package here. Yes it requires RPC requests then but it doesn't require subgraph changes.
I suggest the above because I wonder if there is an opportunity to compose stream-tester together with a broadcaster and a monitoring instance and push most of the additional functionality which seems specific for this leaderboard metrics collection use case into an additional app in the cmd package.
If I understand correctly you would also have any prometheus queries for metrics gathering take place in this additional service then rather than the stream-tester? That seems okay to me although some of them might be useful for other usage within stream-tester as well.
It feels almost easier to just have this "orch-tester" service consume the go-livepeer/eth package here. Yes it requires RPC requests then but it doesn't require subgraph changes.
Using the go-livepeer/eth package will also require event indexing in order to construct the list of active orchestrator ETH addresses + service URIs (which is implemented in go-livepeer, but not in a easily exportable manner at the moment) since contract RPC calls would only give us the pending orchestrator set. I suspect adding service URI indexing to the subgraph and just querying the subgraph will be more straightforward than either re-implementing the event indexing functionality in the orch-tester or refactoring code in go-livepeer to export that functionality from there. Plus, being able to fetch the list of service URIs for active orchestrators from the subgraph seems useful in its own right which would be an added bonus.
If I understand correctly you would also have any prometheus queries for metrics gathering take place in this additional service then rather than the stream-tester?
Yep that was my suggestion.
We can use go-livepeer/eth
and use Client.TranscoderPool()
And for each address
While I realise this will not include orchestrators that will become deactivated in the next round , I'm wondering if benchmarking orchestrators that will soon become deactivated is useful anyway.
Admittedly it would be nice to have the service URI on the subgraph
closed by #62
Performance Leaderboard MVP
Specification
Stream-Tester
Stream-tester will be spun up in server mode for each region we want to run performance benchmarking from. This way we can utilize its REST interface to send in benchmarking requests for several orchestrators using cron-jobs.
POST
/start_streamsCurrent request object
The request will be expanded to take in the following (optional) fields:
[]string
-orchAddr
flag on startup. e.g.["34.13.56.251:8935", "12.55.39.145:89356"]
[]string
["livepeer_success_rate offset 1m", "avg_over_time(livepeer_upload_time)[1m]"]
* The PromQL queries provided in the
metrics
field could be hardcoded as well for an MVPWhen this POST request returns a 2xx code, the metrics collection can take place.
If the
orchestrators
field is provided and the thehost
field omitted stream-tester will spin up a new broadcaster instance using the orchestrators for the-orchAddr
flag and-metrics=true
. The process will be torn down after the stream completed and metrics are gathered.GET
/statsCurrently the
/stats
endpoint takes optional parameters. We can add additional optional parametersorchestrator
andsince
(see storage section) that allows to filter stats per orchestrator and since a specific timestamp.Region
Stream-tester will have a new optional flag on startup:
-region
which will take a string of the airport code (cfr. Livepeer.com API deployments) as argument. This way the stream-tester can self-identify as running in a particular region.Current Regions:
Storage
The stream-tester will be expanded with an interface that allows it to write metrics gathered from a stream to a storage layer.
Currently the stream-tester keeps stats in memory for a manifestID but it doesn't follow a clear interface to do so. We can create a ubiquitous storage layer interface and allow the user to specify the storage driver on node startup.
-db <host>
If no flag is specified, memory storage will be used.
If the
host
starts with themongo://
(local)mongo+srv://
(atlas) prefix we can safely assume we'll need to use MongoDB for the storage driver.If neither of the above, use Postgres as the DB driver, when using a local postgres DB a filesystem path will be provided, otherwise a proxy URL.
Interface
Metrics gathering
While stream-tester gathers per-stream metrics by itself we can also have it query the broadcaster node it uses for prometheus metrics.
One caveat is that the metrics endpoint is exposed on the same port as the CLI webserver so there would need to be some CORS restrictions imposed on all CLI write endpoints. Alternatively the stream-tester would have to run on the same machine as the broadcasters that are used.
For each metric specified in the
metrics
field in the request body for/start_streams
we will then use thebroadcaster:7935/metrics
endpoint to execute the provided promQL queries.After collecting the metrics the data alongside the region (provided on stream-tester startup) will be sent to the storage API.
calculated as segments_downloaded / segments_sent
Leaderboard Backend
Having the client fetch all metrics for every region from all the seperate stream-tester nodes is not a very scalable solution.
If we use a cloud-based database for a production environment then having a server that can be an intermediate query layer is far more scalable and puts less strain on the clients (user's web browser).
Alternatively metrics can be queried in a serverless function. For example MongoDB Atlas plays really well with AWS Lambda, where a Mongo connection exists for the duration of the Lambda function: https://docs.atlas.mongodb.com/best-practices-connecting-to-aws-lambda/.
The REST interface could be fairly simple. One endpoint to get aggregated stats and one endpoint returning the raw logs.
GET /aggregated_stats?orchestrator=<orchAddr>®ion=<Airport_code>&since=<timestamp>
If
orchestrator
is not provided the response will include aggregated scores for all orchestratorsIf `region is not provided all regions will be returned in the response. "GLOBAL" would be included as well which would be the average of the regions.
If
since
is not provided we will default to something sensible, for example 24h.example response
GET /raw_stats?orchestrator=<orchAddr>®ion=<Airport_code>&since=<timestamp>
If no parameter for
orchestrator
is provided the request will return400 Bad Request
example response
For each region return an array of the metrics from the 'metrics gathering' section as a "raw dump"