isi-vista / vista-pegasus-wrapper

A higher-level API for ISI Pegasus, adapted to the quirks of the ISI Vista group
MIT License
2 stars 1 forks source link

Write job logs to the experiment directory #26

Closed gabbard closed 4 years ago

gabbard commented 4 years ago

We currently try to do this here:

but the output is always empty:

https://github.com/isi-vista/vista-pegasus-wrapper/blob/master/pegasus_wrapper/workflow.py#L205

hassanzadeh commented 4 years ago

One observations; if I create a slurm job using the parameters of the Dax file, and submit it, the log file is created, so my guess is the issue is not in the pegasus-wrapper, it is in pegasus-run executable, perhaps pegasus-run cannot understand the --output parameter, another possibility may be that the file remains open while the job tries to open it and it fails. I wanted to test each of these however, pegasus is still not working for me. Pegasus job remains in queue for ever.

hassanzadeh commented 4 years ago

This issue turns out to be not from the wrapper itself. Here are the evidence.

  1. The parameters in the DAX file appear to be correct.
  2. When I check the slurm jobs submitted, and I create independent slurm jobs matching to the original completed slurm jobs (i.e. the ones submitted by pegasus-run), and submit them, the logs do generate.

Therefore, the issue is not the wrapper, thee issue is more likely the pegasus engine. Ideally, it makes sense if someone who has direct ssh access to the compute nodes check the job script (whose address can be found using scontrol show job #id) to make sure that there is nothing wrong with the job script.

Overall, I think the reason why the logs are not redirected to the output file has to do with the pegasus-run that takes care of these files. Perhaps under the hood pegasus opens the log files in write mode which leads to loss of the data. It might also be that pegasus redirects the stdout to somewhere else and hence, only an empty file for log is created.

gabbard commented 4 years ago

Therefore, the issue is not the wrapper, thee issue is more likely the pegasus engine. Ideally, it makes sense if someone who has direct ssh access to the compute nodes check the job script (whose address can be found using scontrol show job #id) to make sure that there is nothing wrong with the job script.

Please write up step-by-step exactly the commands you would like someone with ssh access to run and I will see about getting it done.

Overall, I think the reason why the logs are not redirected to the output file has to do with the pegasus-run that takes care of these files. Perhaps under the hood pegasus opens the log files in write mode which leads to loss of the data. It might also be that pegasus redirects the stdout to somewhere else and hence, only an empty file for log is created.

The source code of Pegasus is here. You should be able to examine it and determine exactly what it is doing when it runs a job.

hassanzadeh commented 4 years ago

Steps to run a simple pipeline:

  1. Creating a workflow: For example for the simple pipeline in the scripts (note I edited it a bit in this pull), I run: python -m pegasus_wrapper.scripts.example_workflow_builder parameters/root.params

where the parameters can be something like:


dir: "%project_root%/experiments"
conda_base_path: /nas/home/hhasan/miniconda3/
conda_environment: "event-gpu-py36"
slurm:
    partition: "gaia"
    account: "gaia" 
    num_cpus: 1
    num_gpus: 0 
    memory: "1G ```

2. The above command will create an experiment directory. I create a pipeline and run it with the following commands (in saga03):
a. ssh saga03
b.  pegasus-plan --conf pegasus.conf  --dax Test.dax --dir $HOME/run
c.  pegasus-run  /nas/home/hhasan/run/hhasan/pegasus/Test/runxxx

3. This will generate two job directories, one for each job (i.e. /working/jobs/multiply and sort). The log file is inside the directories (e.g. /working/jobs/multiply/___stdout.log) which remains empty even after the job finishes.

Note that, if I use the scripts created by the wrapper (e.g. /working/jobs/multiply/___run.sh) and create a slurm job, the output will be written in the ___stdout.log file, but with pegasus-run it won't.

I will try to see if I can spot the issue in the pegasus source code.
gabbard commented 4 years ago

@hassanzadeh : Your directions tell me how to run a pipeline, but they don't tell me what you want me to do with my admin-ssh powers. What are you trying to find out?

hassanzadeh commented 4 years ago

Then, Once you run submit the job,

  1. watch -n 1 squeue -a -u $USER to get the id of the jobs that are submitted.

  2. Once you get the job id, you can see it's status with: scontrol show job #jobid

You should see something like the following, where the Command shows the script is.


   UserId=mdehaven(7631) GroupId=div31(1026) MCS_label=N/A
   Priority=1 Nice=0 Account=saral QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=02:46:01 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2020-07-08T07:44:54 EligibleTime=2020-07-08T07:44:54
   AccrueTime=Unknown
   StartTime=2020-07-08T07:44:54 EndTime=2020-07-09T07:44:54 Deadline=N/A
   PreemptEligibleTime=2020-07-08T07:54:54 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-08T07:44:54
   Partition=saral AllocNode:Sid=saral-sub-01.isi.edu:27465
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=saga22
   BatchHost=saga22
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=8G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/address/to/job
   WorkDir=/nas/material/users/mdehaven/expts/n001/pashto_e2e
   Power=
hassanzadeh commented 4 years ago

Follow up, it looks like the file referred to in my previous comment is actually removed immediately after the job submission, so I guess there is no point going to that direction.

joecummings commented 4 years ago

Closing this as it appears to solved by @hassanzadeh