Collect destination status through rabbitmq

pauldg commented 2 weeks ago

In order to collect destination metrics (available CPU/memory) from Pulsar destinations in the network we can re-purpose the rabbitmq connection to send back htcondor status metrics which can be used for scheduling.

Producer: 1.We write a Python script that is deployed by the admin on the Pulsar side
1. The Python script will collect the metrics from the scheduler and pushes the metrics to its queue in the MQ using kombu
  1. Python script should have access to the pulsar conf file (to access the MQ creds)
Consumer:
1. There will be another Python script on the consumer side running as a Telegraf task which acknowledges and fetches the metrics from the queue in the MQ
2. Compares the time stamp of the last entry in the InfluxDB to it’s local server time and pushes the new metrics and sets a field/tag for online/offline. If there is no metric in the queue then the Telegraf task will automatically set the endpoint to offline when it pushes empty/null metrics to the InfluxDB and TPV API can eliminate the specific destination from its candidates list of destination.
  1. Consumer will be running on the Galaxy side (for example on the maintenance node in the EU) which will have access to the jobconf where it can access the MQ credentials of each queue.
  2. Consumer telegraf task/script will be parallelized (uses multiprocessing) to talk to multiple queues in the MQ

sanjaysrikakulam commented 2 weeks ago

Here is where the scripts are collected at the moment: https://github.com/pauldg/bh2024-metric-collection-scripts/

abdulrahmanazab commented 2 weeks ago

And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?

sanjaysrikakulam commented 2 weeks ago

And this will be "push" mode right? I mean the connection is initiated from the Pulsar endpoint?

Yes, the pulsar destinations will push the metrics (interval needs to be determined) to the queue.

sebastian-luna-valero commented 2 weeks ago

Many thanks for working on this!

My question is whether this tpv-metascheduler-api repository is the right one to collect destination status.

I was wondering whether we would need to have two separate repositories instead:

The existing tpv-metascheduler-api repository will execute the meta-scheduling algorithms.
A new metrics-collector repository to collect destination status and send this back to Galaxy via RabbitMQ (e.g. reusing the scripts proposed in https://github.com/usegalaxy-eu/tpv-metascheduler-api/pull/14/)

Each destination will have a metric-collector installed alongside Pulsar/ARC and HTCondor/SGE/Slurm (by the way, we should design this as an opt-in service, in case some pulsar destinations don't want to share these information?). However, there will only be one instance of the tpv-metascheduler-api service running per Galaxy instance.

What do you think?

sanjaysrikakulam commented 1 week ago

Yes, we will move the producer script to the pulsar-depoyment repo (and/or will create a dedicated Ansible role) and add optional variable and based on this the ansible tasks for copying and setting up a corn job will make decisions. The consumer will have an Ansible role, so admins can easily install it. This consumer role will also have the telegraph task deployment tasks.

Since this is a PoC, we didn't implement the SLURM and Kubernetes metrics collection in the producer script. This will follow up. We want to create a rank function that could be added to a user's conf in the TPV and run some tests on EU to see how this works together.

The Ansible role to deploy the API itself is already available here. I made that for the deployment in ESG instance and is currently being used. I will extract it as an individual role to its dedicated repo.

sanjaysrikakulam commented 1 week ago

xref:

usegalaxy-eu / tpv-metascheduler-api

Collect destination status through rabbitmq #17