Open maxgalli opened 3 years ago
The version difference should not matter. I won't have time to look into it this week but hopefully I can next week. In the meantime, could you add the output of condor_q -long
on one of the jobs that went on hold?
Here it is:
$ condor_q -long 2576278.0
AccountingGroup = "group_u_CMS.u_zh.users.gallim"
Args = "0"
AutoClusterAttrs = "_condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,AccountingGroup,BeerJob,BigMemJob,condor,JobUniverse,LastCheckpointPlatform,MachineLastMatchTime,MaxRuntime,MISSING,NumCkpts,Partit
ionableSlot,Rank,RemoteOwner,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,SendCredential,SlotID,TotalJobRuntime,WantExternalCloud,WantPreempt,x509UserProxyVOName,ConcurrencyLimits,NiceUser,Requirements,Charge
Group,hdd,ssd,HammerCloudJob,NODE_IS_HEALTHY,InStagedDrain,ShutdownTime,NA61Test,WantIPv6Only,Memory,CvmfsCustomExpression,PreAfsRequirements,PreCvmfsRequirements,PreVerRequirements,VanillaRequirements"
AutoClusterId = 39612
CernFlagDeprecatedSLC6 = false
ChargeGroup = "cms"
ChargeGroupType = "experiment"
ChargeRole = "CMS, local users, users"
ClusterId = 2576278
Cmd = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/_htmap_run.py"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CompletionDate = 0
component = 0
CondorPlatform = "$CondorPlatform: X86_64-CentOS_5.11 $"
CondorVersion = "$CondorVersion: 8.9.11 Jan 27 2021 BuildID: UW_Python_Wheel_Build $"
CoreSize = 0
Cpus = "$$(Cpus:1)"
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 0.0
CumulativeSlotTime = 0
CumulativeSuspensionTime = 0
CurrentHosts = 0
CvmfsCustomExpression = false
DataCentre = "$$(DATACENTRE:meyrin)"
DiskUsage = 12
DiskUsage_RAW = 11
DockerImage = "htcondor/htmap-exec:v0.6.1"
EncryptExecuteDirectory = false
EnteredCurrentStatus = 1614285762
Environment = ""
Err = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/logs/0.stderr"
ExecutableSize = 10
ExecutableSize_RAW = 9
ExitBySignal = false
ExitStatus = 0
GlobalJobId = "bigbird19.cern.ch#2576278.0#1614285762"
HEPSPEC = "$$(HEPSPEC:80)"
Hostgroup = "$$(HostGroup:bi/condor/gridworker/share)"
HostLogicalCores = "$$(TotalCpus:8)"
ImageSize = 10
ImageSize_RAW = 9
In = "/dev/null"
IsHTMapJob = true
IsLocalJob = true
Iwd = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/output_files"
JOB_HEPSPEC = "$$([isUndefined(TARGET.HEPSPEC_PER_CORE) ? real(10 * TARGET.Cpus) : real(TARGET.HEPSPEC_PER_CORE * TARGET.Cpus)])"
JobAge = (time() - QDate)
JobBatchName = "red-tiny-fact"
JobLeaseDuration = 5400
JobNotification = 0
JobPrio = 0
JobStatus = 1
JobUniverse = 5
KeepClaimIdle = 30
LastSuspensionTime = 0
LeaveJobInQueue = false
MaxHosts = 1
MaxRuntime = 1200
MinHosts = 1
MyType = "Job"
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobStarts = 0
NumRestarts = 0
NumSystemHolds = 0
OnExitHold = ExitCode =!= 0
OnExitRemove = true
Out = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/logs/0.stdout"
Owner = "gallim"
PeriodicHold = false
PeriodicRelease = false
PeriodicRemove = false
PreAfsRequirements = (OpSysAndVer =?= "CentOS7") && PreVerRequirements
PreCvmfsRequirements = TARGET.HasDocker && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
PreVerRequirements = (NODE_MOUNTS_CVMFS || CvmfsCustomExpression) && PreCvmfsRequirements
ProcId = 0
QDate = 1614285762
Rank = 0.0
RemoteSysCpu = 0.0
RemoteUserCpu = 0.0
RemoteWallClockTime = 0.0
RequestCpus = 1
RequestDisk = 1048576
RequestMemory = 2000
Requirements = (TARGET.Hostgroup =?= "bi/condor/gridworker/share/mixed" || TARGET.Hostgroup =?= "bi/condor/gridworker/shareshort" || TARGET.Hostgroup =?= "bi/condor/gridworker/share/singularity" || TARGET.Hostgrou
p =?= "bi/condor/gridworker/sharelong") && VanillaRequirements
RootDir = "/"
ServerTime = 1614285777
ShouldTransferFiles = "YES"
StreamErr = false
StreamOut = false
SubmitEventNotes = "0"
TargetType = "Machine"
TotalSubmitProcs = 10
TotalSuspensions = 0
TransferIn = false
TransferInput = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/func,/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/inputs/0.in"
TransferInputSizeMB = 0
TransferOutput = "_htmap_transfer/,_htmap_user_transfer/0"
TransferOutputRemaps = " 0.out=/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/outputs/0.out "
User = "gallim@cern.ch"
UserLog = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/events"
VanillaRequirements = (TARGET.NODE_MOUNTS_AFS =?= true) && PreAfsRequirements
WantCheckpoint = false
WantDocker = true
WantRemoteIO = true
WantRemoteSyscalls = false
WhenToTransferOutput = "ON_EXIT_OR_EVICT"
Hello,
Still working on trying to replicate your crash. As for the jobs remaining idle, can you try running condor_q -better-analyze
on one of your htmap jobs? What does that say?
Sure, here it is:
$ condor_q -better-analyze 2592991.0
-- Schedd: bigbird19.cern.ch : <188.185.114.121:9618?...
The Requirements expression for job 2592991.000 is
(TARGET.Hostgroup is "bi/condor/gridworker/share/mixed" || TARGET.Hostgroup is "bi/condor/gridworker/shareshort" || TARGET.Hostgroup is "bi/condor/gridworker/share/singularity" ||
TARGET.Hostgroup is "bi/condor/gridworker/sharelong") && VanillaRequirements
Job 2592991.000 defines the following attributes:
CvmfsCustomExpression = false
PreAfsRequirements = (OpSysAndVer =?= "CentOS7") && PreVerRequirements
PreCvmfsRequirements = TARGET.HasDocker && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
PreVerRequirements = (NODE_MOUNTS_CVMFS || CvmfsCustomExpression) && PreCvmfsRequirements
RequestDisk = 1048576
RequestMemory = 2000
VanillaRequirements = (TARGET.NODE_MOUNTS_AFS =?= true) && PreAfsRequirements
The Requirements expression for job 2592991.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 153611 TARGET.Hostgroup is "bi/condor/gridworker/share/mixed"
[1] 3662 TARGET.Hostgroup is "bi/condor/gridworker/shareshort"
[2] 157273 [0] || [1]
[5] 1099 TARGET.Hostgroup is "bi/condor/gridworker/sharelong"
[6] 158372 [2] || [5]
[7] 21708 VanillaRequirements
[8] 19028 [6] && [7]
2592991.000: Job is held.
Hold reason: Failed to initialize user log to /afs/cern.ch/user/g/gallim/.htmap/maps/41892ce4-f5bb-4f56-8990-1415fcae5d1e/events or /dev/null
Last successful match: Wed Mar 3 09:15:09 2021
2592991.000: Run analysis summary ignoring user priority. Of 16523 machines,
3364 are rejected by your job's requirements
676 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
12483 are able to run your job
It looks like your job is going on hold because HTCondor couldn't create the user (job) log and JobEventLog is failing to initialize because it couldn't read that log. I notice that your .htmap
directory is in AFS; can you setting HTMAP_DIR
to somewhere on local disk instead? You can either set HTMAP_DIR
in the environment before starting your program, or set
HTMAP_DIR = "/path/to/htmap/dir"
at the beginning of ~/.htmaprc
.
I set HTMAP_DIR
to /tmp/gallim/htmap
but I get the same result (same error raised and jobs submitted but kept in idle)
Interesting. I'm out next week but I'll take another look once I get back. In the meantime though, are your HTMap jobs idle or going on hold? The -better-analyze output from above (condor_q -better-analyze 2592991.0
) shows a job that is on hold. If you have an HTMap job that is idle, can you run condor_q -better-analyze
on that job? If some of your jobs go on hold and others just remain idle then we might be dealing with two different errors.
Hi,
I'm tying to run the following very basic example:
but I always get the following error:
The jobs seem to be submitted and kept on hold.
Do you know what the problem could be? If you think it is something related to the configuration of HTCondor itself on our batch system, do you have any suggestions of what I could possibly ask to the administrators? By looking at version packages there seem to be an inconsistency between the version of Condor (8.8.7) and the python bindings that I'm using (8.9.11). Do you think this might cause the problem?
Thank you,
Massimiliano
Software Versions: