Blackhole Detection: stop accepting jobs if they are consumed at a rate higher than the configured limit and declare the Glidein a blackhole - Githubissues

namrathaurs commented 1 year ago

Sometimes even if the Glidein tests are successful, it fails quickly all the jobs running on it and asks for new ones. This behavior is nicknamed blackhole. By setting a maximum consumption rate and declaring unfit (and unable to accept/match new jobs) a Glidein consuming jobs faster than the configured rate, we can protect the system from undetected failures. This is the mechanism to be implemented by this ticket.

A new HTCSS feature called STARTD_LATCH_EXPRS (https://opensciencegrid.atlassian.net/browse/HTCONDOR-171) should be useful to implement an irreversible condition triggering the blackhole status.

This feature request is being ported from Redmine ticket R#23253 to 3.10.x

Blackhole Detection Logs: Publishing Expression/ClassAd Attributes to the StartdLogs

Description:

The results of an expression evaluating if a node is a blackhole(R#19214) are published in the machine classAd. We would like to see them in the StartdLog. The logs about blackhole detection are covered in condor logs and glidein logs( client directory in the Factory).

We were discussing with HTCondor team about publishing an expression/classad to the StartdLog and/or an external file. Between the different ideas, it came up that we cannot use a hook that can be triggered each time an attribute changes the value as it would be impractical.

If we wanna use a startd_cron that periodically checks and publishes the value (a script accessing the machine classAd and writing out the interesting value), it could work. Although we wouldn't be able to write a message to the StartLog very easily.

TODO: Implement the periodically checking and writing in the startd logs with startd_cron.

NOTE to keep in mind from TJ: Check if there is something in the glidein mechanism that would give us a cron-like place to put a hook that would be better than using STARTD_CRON.

This feature request is related to Redmine Feature R#19214 - Add a configurable limit to the rate of jobs running and fail the glidein if the rate is passed. The notes/description of R#19214, carried over from Redmine, is attached: glideinwms-19214.pdf

mambelli commented 8 months ago

From the discussion at the monthly HTCondor-Fermilab meetings:

3 - Allow history in startd attributes (Marco Mambelli) [In Progress]
Description: this was requested to complete a black hole protection mechanism using the jobs rate (https://cdcvs.fnal.gov/redmine/issues/19214). There are open HTCondor tickets
Not going to make it into 9.0.0, New feature for 9.1.X.
No progress apr-may
ZKM: This is now JIRA ticket: https://opensciencegrid.atlassian.net/browse/HTCONDOR-171
Action item on Marco/Gwms: Gather a concrete list of attributes to keep, add to ticket/document. ADDED.
Would like the ability to have history of an attribute that we set within the Glidein, 
Currently we have an attribute per slot: BLACKHOLE_TRIGGERED_$(i) or one if a partitionable slot is used. We want history for these because we want the variable to behave like a trip switch, once set because the jobs went over a certain threshold we'd like this to remain set (also if the pilot stops accepting jobs and the threshold goes down (the pilot keeps running for a bit to avoid that a new one starts in its place.)
This is calculated using RecentJobBusyTimeAvg, RecentJobBusyTimeCount but it should not be relevant
Action Item on HTC (Dec 2021): Review the attributes listed and update the JIRA/document
AI: Follow up from HTC: [https://openscien cegrid.atlassian.net/browse/HTCONDOR-171](https://opensciencegrid.atlassian.net/browse/HTCONDOR-171)
July–no progress..
August - work in progress
September–work still in progress
October–marco can shift the original redmine issue to a github issue so it will be visible to Cole.. we already have access to the total number of jobs started but we don’t have the rate.  Schedd ads have history, startd jobs
December–Cole can now see the redmine issue–up until now there has mostly just been discussion. 
Glide checks the machine classad with a periodic machine to advertise to the htcondor classad to have a history, to check previous value of the attribute to trigger the START attribute to go false or something like that.
MArco–problem with black holes they churn too many jobs, want to block it from receiving new jobs.  But if we disable the node the rate of consumption goes down and it will look ok
Cole–have discussed with TJ–he has some concerns with history in the startd.
New idea to propose:  build a couple knobs onto the startd cron that would allow a condition to add certain attributes to the machine ad until the lease time on the startd cron is up.
Marco–want next loop to be able to not change the value if it was set in the previous one.  Would it ignore updates during the lease?  Cole: Yes.
Should followup in an e-mail loop with TJ and Marco  to put more details down so we can look in more detail.
January –Cole sent a e-mail to Marco –haven’t followed up yet.  TJ mentioned something about a lighter weight startd cron that he was working on.  Goal is to implement an attribute that could set and not change in the future and using the past value you could see that there was a blackhole.  Todd–did you want the history to be able to set an attribute back?  Marco–want something that’s tripped and remains set.  If rate of starts goes above a certain value we want to trigger a blackhole flag and leave it set so that the rate of starts doesn’t go to zero.  Why  not have it living all the time.whatever it spits to stdout will go into the classads. Here is an example of a startd cron that lives all the time (loop with a sleep) and queries the startd directly without any collector access:
https://www-auth.cs.wisc.edu/lists/htcondor-users/2022-December/msg00049.shtml
February –no update 
March–no update 
April–Cole did make a bash script which can poll the startd as above, will send to Marco. 
Also new STARTD configuration option called “latch expressions”.. Set a Latch name, when it evaluates to true it sets 2 different startd attributes.  Explore 10.4 docs it is in there.  Marco–interested in trying with the latch.
STARTD_LATCH_EXPRS
May - Cole to send out bash script
June– script was sent.. No update yet. 
August 2023–nothing has moved on the JIRA ticket in more than a year
Marco–Namratha is working on a glideinwms ticket using the startd LATCH expressions now.
Sep 2023 still in progress…
Oct 2023 -  Still in progress
Nov 2023 - Still in progress - Current plan is to use the LATCH_EXPRS
Dec 2023 – anything move?  Still waiting.

mambelli commented 8 months ago

Material about the topic:

Redmine ticket 19214: Add a configurable limit to the rate of jobs running and fail the glidein if the rate is passed (main goal)
Redmine ticket 23253: Blackhole detection logs: Publishing expression/classAd attributes to the StartdLogs
Condor ticket 6698: Provide the ability to identify black hole slots (use this to have the statistics to use in the classad)
Condor ticket 7328: Make knob for the startd to drop final machinead into log on shutdown (use this to publish the results in the StartdLog - so admin can know if the Glidein was a Black hole)
Condor ticket 171, was Condor ticket 7329: Keep history of updates to machine ads similar to how job ads work (this resulted in the latch expression which will block the black-hole status to revert back after stopping to process new jobs)
There is some work done in branch v35/19214. It is old (v3.5). Recovering it is a git archeology effort since the files may have been changing a lot and the code may need to be moved/adapted. We may be lucky that it is shell code and config files that may have changed less than other parts.

Plan of work:

Check the content of v35/19214 (git checkout v35/19214; git difftool 33b28f7e2afccac394b7c1ad3717d00104deb32d)
Start from master and re-apply the changes (only the ones in the meaningful files - the rest can be a separate PR if desired and not already in the code) - NOTE: apply the diff above, not the diff from the current master:
- creation/web_base/condor_startup.sh
- creation/web_base/condor_vars.lst
- doc/factory/custom_vars.html
A commit w/ these initial changes reapplied could be credited to Lorena
Test the partial feature (only elements from the previous branch v35/19214 ported to master)
Add the latch expression (STARTD_LATCH_EXPRS) so that GLIDEIN_BLACKHOLE is not changing back once jobs are paused

glideinWMS / glideinwms

Blackhole Detection: stop accepting jobs if they are consumed at a rate higher than the configured limit and declare the Glidein a blackhole #331

This feature request is being ported from Redmine ticket R#23253 to 3.10.x