htcondor / htmap

High-Throughput Computing in Python, powered by HTCondor
https://htmap.readthedocs.io
Apache License 2.0
32 stars 10 forks source link

JobEventLog not initialized when running simple map #234

Open maxgalli opened 3 years ago

maxgalli commented 3 years ago

Hi,

I'm tying to run the following very basic example:

import htmap

def double(x):
    return 2 * x

def main():
    doubled = list(htmap.map(double, range(10)))
    print(doubled)

if __name__ == "__main__":
    main()

but I always get the following error:

Traceback (most recent call last):
  File "base.py", line 11, in <module>
    main()
  File "base.py", line 7, in main
    doubled = list(htmap.map(double, range(10)))
  File "/afs/cern.ch/work/g/gallim/anaconda3/envs/my_messy_root/lib/python3.7/site-packages/htmap/maps.py", line 413, in __iter__
    yield from self.iter()
  File "/afs/cern.ch/work/g/gallim/anaconda3/envs/my_messy_root/lib/python3.7/site-packages/htmap/maps.py", line 427, in iter
    yield self._load_output(component, timeout=timeout)
  File "/afs/cern.ch/work/g/gallim/anaconda3/envs/my_messy_root/lib/python3.7/site-packages/htmap/maps.py", line 342, in _load_output
    self._wait_for_component(component, timeout)
  File "/afs/cern.ch/work/g/gallim/anaconda3/envs/my_messy_root/lib/python3.7/site-packages/htmap/maps.py", line 298, in _wait_for_component
    component_status = self.component_statuses[component]
  File "/afs/cern.ch/work/g/gallim/anaconda3/envs/my_messy_root/lib/python3.7/site-packages/htmap/maps.py", line 561, in component_statuses
    return self._state.component_statuses
  File "/afs/cern.ch/work/g/gallim/anaconda3/envs/my_messy_root/lib/python3.7/site-packages/htmap/state.py", line 89, in component_statuses
    self._read_events()
  File "/afs/cern.ch/work/g/gallim/anaconda3/envs/my_messy_root/lib/python3.7/site-packages/htmap/state.py", line 115, in _read_events
    self._event_reader = htcondor.JobEventLog(self._event_log_path.as_posix()).events(0)
htcondor.HTCondorIOError: JobEventLog not initialized.  Check the debug log, looking for ReadUserLog or FileModifiedTrigger.  (Or call htcondor.enable_debug() and try again.)

The jobs seem to be submitted and kept on hold.

Do you know what the problem could be? If you think it is something related to the configuration of HTCondor itself on our batch system, do you have any suggestions of what I could possibly ask to the administrators? By looking at version packages there seem to be an inconsistency between the version of Condor (8.8.7) and the python bindings that I'm using (8.9.11). Do you think this might cause the problem?

Thank you,

Massimiliano

Software Versions:

$ python -c "import htcondor, htmap; print(htcondor.version()); print(htmap.version())"
$CondorVersion: 8.9.11 Jan 27 2021 BuildID: UW_Python_Wheel_Build $
HTMap version 0.6.1
$ condor_version
$CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $
$CondorPlatform: x86_64_RedHat7 $
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="http://cern.ch/linux/"
BUG_REPORT_URL="http://cern.ch/linux/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
$ cat /proc/version
Linux version 3.10.0-1160.11.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Fri Dec 18 16:34:56 UTC 2020
matyasselmeci commented 3 years ago

The version difference should not matter. I won't have time to look into it this week but hopefully I can next week. In the meantime, could you add the output of condor_q -long on one of the jobs that went on hold?

maxgalli commented 3 years ago

Here it is:

$ condor_q -long 2576278.0
AccountingGroup = "group_u_CMS.u_zh.users.gallim"
Args = "0"
AutoClusterAttrs = "_condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,AccountingGroup,BeerJob,BigMemJob,condor,JobUniverse,LastCheckpointPlatform,MachineLastMatchTime,MaxRuntime,MISSING,NumCkpts,Partit
ionableSlot,Rank,RemoteOwner,RequestCpus,RequestDisk,RequestGPUs,RequestMemory,SendCredential,SlotID,TotalJobRuntime,WantExternalCloud,WantPreempt,x509UserProxyVOName,ConcurrencyLimits,NiceUser,Requirements,Charge
Group,hdd,ssd,HammerCloudJob,NODE_IS_HEALTHY,InStagedDrain,ShutdownTime,NA61Test,WantIPv6Only,Memory,CvmfsCustomExpression,PreAfsRequirements,PreCvmfsRequirements,PreVerRequirements,VanillaRequirements"
AutoClusterId = 39612
CernFlagDeprecatedSLC6 = false
ChargeGroup = "cms"
ChargeGroupType = "experiment"
ChargeRole = "CMS, local users, users"
ClusterId = 2576278
Cmd = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/_htmap_run.py"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CompletionDate = 0
component = 0
CondorPlatform = "$CondorPlatform: X86_64-CentOS_5.11 $"
CondorVersion = "$CondorVersion: 8.9.11 Jan 27 2021 BuildID: UW_Python_Wheel_Build $"
CoreSize = 0
Cpus = "$$(Cpus:1)"
CumulativeRemoteSysCpu = 0.0
CumulativeRemoteUserCpu = 0.0
CumulativeSlotTime = 0
CumulativeSuspensionTime = 0
CurrentHosts = 0
CvmfsCustomExpression = false
DataCentre = "$$(DATACENTRE:meyrin)"
DiskUsage = 12
DiskUsage_RAW = 11
DockerImage = "htcondor/htmap-exec:v0.6.1"
EncryptExecuteDirectory = false
EnteredCurrentStatus = 1614285762
Environment = ""
Err = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/logs/0.stderr"
ExecutableSize = 10
ExecutableSize_RAW = 9
ExitBySignal = false
ExitStatus = 0
GlobalJobId = "bigbird19.cern.ch#2576278.0#1614285762"
HEPSPEC = "$$(HEPSPEC:80)"
Hostgroup = "$$(HostGroup:bi/condor/gridworker/share)"
HostLogicalCores = "$$(TotalCpus:8)"
ImageSize = 10
ImageSize_RAW = 9
In = "/dev/null"
IsHTMapJob = true
IsLocalJob = true
Iwd = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/output_files"
JOB_HEPSPEC = "$$([isUndefined(TARGET.HEPSPEC_PER_CORE) ? real(10 * TARGET.Cpus) : real(TARGET.HEPSPEC_PER_CORE * TARGET.Cpus)])"
JobAge = (time() - QDate)
JobBatchName = "red-tiny-fact"
JobLeaseDuration = 5400
JobNotification = 0
JobPrio = 0
JobStatus = 1
JobUniverse = 5
KeepClaimIdle = 30
LastSuspensionTime = 0
LeaveJobInQueue = false
MaxHosts = 1
MaxRuntime = 1200
MinHosts = 1
MyType = "Job"
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobStarts = 0
NumRestarts = 0
NumSystemHolds = 0
OnExitHold = ExitCode =!= 0
OnExitRemove = true
Out = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/logs/0.stdout"
Owner = "gallim"
PeriodicHold = false
PeriodicRelease = false
PeriodicRemove = false
PreAfsRequirements = (OpSysAndVer =?= "CentOS7") && PreVerRequirements
PreCvmfsRequirements = TARGET.HasDocker && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
PreVerRequirements = (NODE_MOUNTS_CVMFS || CvmfsCustomExpression) && PreCvmfsRequirements
ProcId = 0
QDate = 1614285762
Rank = 0.0
RemoteSysCpu = 0.0
RemoteUserCpu = 0.0
RemoteWallClockTime = 0.0
RequestCpus = 1
RequestDisk = 1048576
RequestMemory = 2000
Requirements = (TARGET.Hostgroup =?= "bi/condor/gridworker/share/mixed" || TARGET.Hostgroup =?= "bi/condor/gridworker/shareshort" || TARGET.Hostgroup =?= "bi/condor/gridworker/share/singularity" || TARGET.Hostgrou
p =?= "bi/condor/gridworker/sharelong") && VanillaRequirements
RootDir = "/"
ServerTime = 1614285777
ShouldTransferFiles = "YES"
StreamErr = false
StreamOut = false
SubmitEventNotes = "0"
TargetType = "Machine"
TotalSubmitProcs = 10
TotalSuspensions = 0
TransferIn = false
TransferInput = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/func,/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/inputs/0.in"
TransferInputSizeMB = 0
TransferOutput = "_htmap_transfer/,_htmap_user_transfer/0"
TransferOutputRemaps = " 0.out=/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/outputs/0.out "
User = "gallim@cern.ch"
UserLog = "/afs/cern.ch/user/g/gallim/.htmap/maps/5f0cd7ed-92b4-4487-b4ee-38d7b8dddb24/events"
VanillaRequirements = (TARGET.NODE_MOUNTS_AFS =?= true) && PreAfsRequirements
WantCheckpoint = false
WantDocker = true
WantRemoteIO = true
WantRemoteSyscalls = false
WhenToTransferOutput = "ON_EXIT_OR_EVICT"
matyasselmeci commented 3 years ago

Hello, Still working on trying to replicate your crash. As for the jobs remaining idle, can you try running condor_q -better-analyze on one of your htmap jobs? What does that say?

maxgalli commented 3 years ago

Sure, here it is:

$ condor_q -better-analyze 2592991.0

-- Schedd: bigbird19.cern.ch : <188.185.114.121:9618?...
The Requirements expression for job 2592991.000 is

    (TARGET.Hostgroup is "bi/condor/gridworker/share/mixed" || TARGET.Hostgroup is "bi/condor/gridworker/shareshort" || TARGET.Hostgroup is "bi/condor/gridworker/share/singularity" ||
      TARGET.Hostgroup is "bi/condor/gridworker/sharelong") && VanillaRequirements

Job 2592991.000 defines the following attributes:

    CvmfsCustomExpression = false
    PreAfsRequirements = (OpSysAndVer =?= "CentOS7") && PreVerRequirements
    PreCvmfsRequirements = TARGET.HasDocker && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
    PreVerRequirements = (NODE_MOUNTS_CVMFS || CvmfsCustomExpression) && PreCvmfsRequirements
    RequestDisk = 1048576
    RequestMemory = 2000
    VanillaRequirements = (TARGET.NODE_MOUNTS_AFS =?= true) && PreAfsRequirements

The Requirements expression for job 2592991.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]      153611  TARGET.Hostgroup is "bi/condor/gridworker/share/mixed"
[1]        3662  TARGET.Hostgroup is "bi/condor/gridworker/shareshort"
[2]      157273  [0] || [1]
[5]        1099  TARGET.Hostgroup is "bi/condor/gridworker/sharelong"
[6]      158372  [2] || [5]
[7]       21708  VanillaRequirements
[8]       19028  [6] && [7]

2592991.000:  Job is held.

Hold reason: Failed to initialize user log to /afs/cern.ch/user/g/gallim/.htmap/maps/41892ce4-f5bb-4f56-8990-1415fcae5d1e/events or /dev/null

Last successful match: Wed Mar  3 09:15:09 2021

2592991.000:  Run analysis summary ignoring user priority.  Of 16523 machines,
   3364 are rejected by your job's requirements
    676 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
  12483 are able to run your job
matyasselmeci commented 3 years ago

It looks like your job is going on hold because HTCondor couldn't create the user (job) log and JobEventLog is failing to initialize because it couldn't read that log. I notice that your .htmap directory is in AFS; can you setting HTMAP_DIR to somewhere on local disk instead? You can either set HTMAP_DIR in the environment before starting your program, or set

HTMAP_DIR = "/path/to/htmap/dir"

at the beginning of ~/.htmaprc.

maxgalli commented 3 years ago

I set HTMAP_DIR to /tmp/gallim/htmap but I get the same result (same error raised and jobs submitted but kept in idle)

matyasselmeci commented 3 years ago

Interesting. I'm out next week but I'll take another look once I get back. In the meantime though, are your HTMap jobs idle or going on hold? The -better-analyze output from above (condor_q -better-analyze 2592991.0) shows a job that is on hold. If you have an HTMap job that is idle, can you run condor_q -better-analyze on that job? If some of your jobs go on hold and others just remain idle then we might be dealing with two different errors.