Open pauldg opened 2 weeks ago
Here is where the scripts are collected at the moment: https://github.com/pauldg/bh2024-metric-collection-scripts/
And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?
And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?
Yes, the pulsar destinations will push the metrics (interval needs to be determined) to the queue.
Many thanks for working on this!
My question is whether this tpv-metascheduler-api
repository is the right one to collect destination status.
I was wondering whether we would need to have two separate repositories instead:
tpv-metascheduler-api
repository will execute the meta-scheduling algorithms.metrics-collector
repository to collect destination status and send this back to Galaxy via RabbitMQ (e.g. reusing the scripts proposed in https://github.com/usegalaxy-eu/tpv-metascheduler-api/pull/14/)Each destination will have a metric-collector
installed alongside Pulsar/ARC and HTCondor/SGE/Slurm (by the way, we should design this as an opt-in service, in case some pulsar destinations don't want to share these information?). However, there will only be one instance of the tpv-metascheduler-api
service running per Galaxy instance.
What do you think?
Yes, we will move the producer script to the pulsar-depoyment repo (and/or will create a dedicated Ansible role) and add optional variable and based on this the ansible tasks for copying and setting up a corn job will make decisions. The consumer will have an Ansible role, so admins can easily install it. This consumer role will also have the telegraph task deployment tasks.
Since this is a PoC, we didn't implement the SLURM and Kubernetes metrics collection in the producer script. This will follow up. We want to create a rank function that could be added to a user's conf in the TPV and run some tests on EU to see how this works together.
The Ansible role to deploy the API itself is already available here. I made that for the deployment in ESG instance and is currently being used. I will extract it as an individual role to its dedicated repo.
In order to collect destination metrics (available CPU/memory) from Pulsar destinations in the network we can re-purpose the rabbitmq connection to send back htcondor status metrics which can be used for scheduling.