i-VRESSE / bartender

Middleware web service to schedule jobs on various infrastructures
https://i-vresse-bartender.readthedocs.io
Apache License 2.0
1 stars 0 forks source link

Dirac test fail 71 #72

Closed sverhoeven closed 1 year ago

sverhoeven commented 1 year ago

Fixes #71

When a job submitted to the DIRAC scheduler fails then logs of it can be requested with /job//{jobid}/stdout and /job//{jobid}/stderr endpoints.

The input.tar and output.tar are now staged by job description language aka jdl

sverhoeven commented 1 year ago
Job failed uploading outputs ```shell dirac-wms-job-logging-info 1 Source Status MinorStatus ApplicationStatus DateTime ================================================================================================================= JobManager Received Job accepted Unknown 2023-06-06 15:43:12 JobPath Checking JobSanity Unknown 2023-06-06 15:43:12 JobSanity Checking InputData Unknown 2023-06-06 15:43:12 InputData Checking JobScheduling Unknown 2023-06-06 15:43:12 JobScheduling Waiting Pilot Agent Submission Unknown 2023-06-06 15:43:12 Matcher Matched Assigned Unknown 2023-06-06 15:45:47 JobAgent@MyGrid.Site1.uk Matched Job Received by Agent Unknown 2023-06-06 15:45:47 JobAgent@MyGrid.Site1.uk Matched Submitting To CE Unknown 2023-06-06 15:45:47 JobWrapper Running Job Initialization Unknown 2023-06-06 15:45:48 JobWrapper Running Downloading InputSandbox Unknown 2023-06-06 15:45:48 JobWrapper Running Input Data Resolution Unknown 2023-06-06 15:45:48 JobWrapper Running Application Unknown 2023-06-06 15:45:49 JobWrapper Completing Application Finished Successfully Unknown 2023-06-06 15:45:54 JobWrapper Completing Uploading Output Sandbox Unknown 2023-06-06 15:45:54 JobWrapper Completing Uploading Output Data Unknown 2023-06-06 15:45:54 JobWrapper Completing Output Data Uploaded Unknown 2023-06-06 15:45:55 JobWrapper Failed Uploading Outputs Unknown 2023-06-06 15:45:55 ``` ```shell cat ~diracpilot/localsite/output/764C2538.out ... Attempting dm.putAndRegister ('/tutoVO/user/c/ciuser/bartenderjobs/job1/output.tar','/home/diracpilot/shared/work/764C2538/DIRAC_HVplljpilot/1/output.tar','StorageElementOne',guid='13FA99AA-71FB-67E3-2797-1BDE838B3613',catalog='[]', checksum = 'cd489abb') Error sending accounting record URL for service Accounting/DataStore not found Error sending accounting record URL for service Accounting/DataStore not found dm.putAndRegister successfully uploaded and registered output.tar to StorageElementOne 2023-06-06 15:45:55 UTC None/[1]JobWrapper INFO: "output.tar" successfully uploaded to "StorageElementOne" as "LFN:/tutoVO/user/c/ciuser/bartenderjobs/job1/output.tar" JobWrapper raised exception while processing output files Traceback (most recent call last): File "/home/diracpilot/shared/work/764C2538/DIRAC_HVplljpilot/job/Wrapper/Wrapper_1", line 209, in execute result = job.processJobOutputs() File "/home/diracpilot/shared/work/764C2538/DIRAC_HVplljpilot/diracos/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/JobWrapper/JobWrapper.py", line 873, in processJobOutputs if not result_sbUpload["OK"]: UnboundLocalError: local variable 'result_sbUpload' referenced before assignment 2023-06-06 15:45:55 UTC None/[1]JobWrapper INFO: EXECUTION_RESULT[CPU] in sendJobAccounting 0.00 0.01 0.00 0.00 0.02 Error sending accounting record URL for service Accounting/DataStore not found 2023-06-06 15:45:55 UTC WorkloadManagement/JobAgent/InProcess WARN: Fail in payload execution 2023-06-06 15:45:55 UTC WorkloadManagement/JobAgent/InProcess INFO: Exit status: 2 2023-06-06 15:45:55 UTC WorkloadManagement/JobAgent/WorkloadManagement/JobAgent INFO: Job submitted (DIRAC JobID: 1; Batch ID: None 2023-06-06 15:46:05 UTC WorkloadManagement/JobAgent/WorkloadManagement/JobAgent INFO: JobAgent will stop with message "Payload execution failed with error code 2", execution complete. ... ```
sverhoeven commented 1 year ago

If job command exits non-zero then output.tar is not uploaded. but we want the stdout.txt and stderr.txt inside output.tar to debug error. also nice to get other output files inside output.tar. See https://github.com/DIRACGrid/DIRAC/blob/7df4ff26dd936eca8b30bdebcc78cd1509122de4/src/DIRAC/WorkloadManagementSystem/JobWrapper/JobWrapper.py#L820-L822

Possible fixes:

1. Capture command log

When returncode is non-zero then cat stderr.txt + stdout.txt to stderr so it ends up in jobstderr.txt. then when self.state(jobid) is failed then

  1. download output sandbox
  2. get returnvode from dirac
  3. pack jobstderr.txt + returncode into output.tar
  4. upload output.tar to where fs.upload expects it Cons:
    • job output is lost
    • mixes filesystem methods in scheduler Pros:
    • log is retained

2. Swallow return code of command

Cons:

3. Fail job without log or output

Pros:

4. Use output sandbox for all output

5. Change dirac code

Make it upload output.tat even when job is failed.

6. Upload output.tar In job script

Instead of using OutputData in jdl. Con:

sverhoeven commented 1 year ago

7. AbstractScheduler.logs(job_id, job_dir) to get stdout and stderr

Echo command log to jobstdout.txt Pros:

Cons:

sverhoeven commented 1 year ago

Went for option 7