Clarify how wall-clock time is constrained?

ickc commented 7 months ago

First part of #14 is partially addressed in commit 7b88aaf. But I still don't understand how it is counted, copied from #14 below:

About wall clock time, when is the 2nd constraint not the same as the first constraint? How is it going to track CPU hours?

The wallclock time needs to be documented, but I don't think we have all the information to add such page yet. I envisioned it to be something like https://docs.nersc.gov/jobs/policy/ where wall-clock time should be part of the constraints on the computer systems that the user should be aware of. C.f. #6.

From your wordings, it seems it is somehow measuring how much CPU time it is using? Shouldn't wall-clock time be exactly 72 hours in the real world regardless how the job is configured (or using resources)? I'm guessing because "Cgroup-Based Process Tracking" is not done via HTCondor?

rwf14f commented 7 months ago

Yes, the maximum wall-clock time is 72 hours in the real world, regardless of job configuration or used resources, ie a job gets killed if RemoteWallClockTime > 72 hours.

There's also a limit on total used CPU resources (RemoteSysCpu + RemoteUserCpu) to protect the system from improperly configured jobs. HTCondor uses cgroups, but by default only uses soft limits. If a job requests 1 CPU but attempts to use 2 then cgroups will ensure that it's only using 1, but only on a fully loaded worker node where all resources are claimed and used by jobs. If there are free CPU resources available on the worker node then this job would be able to use 2 CPUs. A job is allowed to use the CPU resources it requests, ie wall-clock time * total requested CPUs. If a job uses twice the CPU resources it requests then it gets killed at half the wall-clock time.

I don't know how HTCondor measures remote CPU usage of a job, nor if they have documented it. You might have to check the source code or ask that question on the HTCondor mailing list.

ickc commented 7 months ago

Thanks. I reopened #10 in light of this.

To continue on this issue, will you consider to change it to hard limit instead? I understand job can opportunistically consume more if the node is idle anyway, but it seems to me to be quite error-prone to configure a job to do this correctly (i.e. opportunistically consumes as much as possible but not over-subscribing).

Also, adjusting the wall-clock time according to how much it consumes also seems a bad choice. I'd say from the point of view of economics of fair-share (i.e. treating fair-share as value and apply economic concepts of value to this), this defeat the purpose of enabling a job to consume more that it is requesting. Because the incentive here is to opportunistically take up idle resources that would have been wasted anyway. But if taking advantage of this properly costs me the same fair-share, (even ignoring the risk of improper usage leading to over-subscription too which increase the amount of fair-share consumed), then there's no incentive to do this.

Coupled to this lack of economy (of opportunistically consume more resources than requested) with the complicated rule that would confuses users (i.e. it is not predictable before job launch, and is not deterministic), I'd say it does more harm than good.

rwf14f commented 7 months ago

The CPU limits have nothing to do with opportunistic use of resources, it's a protection against rogue jobs. Users are not allowed to use more CPU resources than they request. As the jobs are controlled by users, it has happened in the past that users ran software that detected the number of cores automatically and started to run on all of them because the user forgot, or didn't know how to set the option to restrict it to the requested CPU count. The CPU limits are there to terminate such jobs prematurely to prevent them from occupying the resources for too long. Users will be told off if they use more CPUs than they request and banned if they keep doing it. This rarely happens, I don't think we've had any of those jobs in the last few years.

We can look into setting hard limits for CPU resources with cgroups, but I don't know if this is (easily) possible with htcondor.

ickc commented 7 months ago

I understand what you describe. Let's dissect it into 3 different issues:

How to prevent users from accidentally using more resources than requests. See #10. I think setting a better defaults for *_NUM_THREADS equals to the no. of physical cores instead would helps in most cases. Of course users (and the library users use) can shoot in their own foot alongside ours, but a better default would helps in other cases that the user did nothing.
How to discover and discourage users from accidentally consuming more than requested (i.e. protect against rogue jobs). This is currently done by reducing wall-clock time by the ratio of overuse of CPU resources. Since you mentioned this rarely happens, and monitoring is needed anyway (in order to ban these users), why not just stop do this scaling of wall-clock time then? As I said, this makes it very hard to communicate to the end users and has no obvious advantage (E.g. how likely such jobs are killed very early before it does much harm? Even if the ratio is 72, it would have already run for an hour.)
How else to protect against using more CPU than requested with HTCondor? This is not very clear to me either. Just to copy from relevant part of the doc:

In addition to memory, the condor_starter can also control the total amount of CPU used by all processes within a job. To do this, it writes a value to the cpu.shares attribute of the cgroup cpu controller. The value it writes is copied from the Cpus attribute of the machine slot ClassAd multiplied by 100. Again, like the Memory attribute, this value is fixed for static slots, but dynamic under partitionable slots. This tells the operating system to assign cpu usage proportionally to the number of cpus in the slot. Unlike memory, there is no concept of soft or hard, so this limit only applies when there is contention for the cpu. That is, on an eight core machine, with only a single, one-core slot running, and otherwise idle, the job running in the one slot could consume all eight cpus concurrently with this limit in play, if it is the only thing running. If, however, all eight slots where running jobs, with each configured for one cpu, the cpu usage would be assigned equally to each job, regardless of the number of processes or threads in each job. From Setting Up for Special Environments — HTCondor Manual 23.2.0 documentation

But the documented behavior in the bolded sentense is not what I observed in #10. I checked the 9.0.17 (the version at Blackett) manual also has a similar text so it couldn't be due to newer version of HTCondor. It is not the first time I saw something documented in HTCondor behaves differently than implemented so perhaps it is a bug.

P.S. only (2) is about this issue (#35), while (1) and (3) are about issue #10 instead.

simonsobs-uk / data-centre

Clarify how wall-clock time is constrained? #35