Closed gabbard closed 4 years ago
One observations; if I create a slurm job using the parameters of the Dax file, and submit it, the log file is created, so my guess is the issue is not in the pegasus-wrapper, it is in pegasus-run executable, perhaps pegasus-run cannot understand the --output parameter, another possibility may be that the file remains open while the job tries to open it and it fails. I wanted to test each of these however, pegasus is still not working for me. Pegasus job remains in queue for ever.
This issue turns out to be not from the wrapper itself. Here are the evidence.
Therefore, the issue is not the wrapper, thee issue is more likely the pegasus engine. Ideally, it makes sense if someone who has direct ssh access to the compute nodes check the job script (whose address can be found using scontrol show job #id) to make sure that there is nothing wrong with the job script.
Overall, I think the reason why the logs are not redirected to the output file has to do with the pegasus-run that takes care of these files. Perhaps under the hood pegasus opens the log files in write mode which leads to loss of the data. It might also be that pegasus redirects the stdout to somewhere else and hence, only an empty file for log is created.
Therefore, the issue is not the wrapper, thee issue is more likely the pegasus engine. Ideally, it makes sense if someone who has direct ssh access to the compute nodes check the job script (whose address can be found using scontrol show job #id) to make sure that there is nothing wrong with the job script.
Please write up step-by-step exactly the commands you would like someone with ssh
access to run and I will see about getting it done.
Overall, I think the reason why the logs are not redirected to the output file has to do with the pegasus-run that takes care of these files. Perhaps under the hood pegasus opens the log files in write mode which leads to loss of the data. It might also be that pegasus redirects the stdout to somewhere else and hence, only an empty file for log is created.
The source code of Pegasus is here. You should be able to examine it and determine exactly what it is doing when it runs a job.
Steps to run a simple pipeline:
where the parameters can be something like:
dir: "%project_root%/experiments"
conda_base_path: /nas/home/hhasan/miniconda3/
conda_environment: "event-gpu-py36"
slurm:
partition: "gaia"
account: "gaia"
num_cpus: 1
num_gpus: 0
memory: "1G ```
2. The above command will create an experiment directory. I create a pipeline and run it with the following commands (in saga03):
a. ssh saga03
b. pegasus-plan --conf pegasus.conf --dax Test.dax --dir $HOME/run
c. pegasus-run /nas/home/hhasan/run/hhasan/pegasus/Test/runxxx
3. This will generate two job directories, one for each job (i.e. /working/jobs/multiply and sort). The log file is inside the directories (e.g. /working/jobs/multiply/___stdout.log) which remains empty even after the job finishes.
Note that, if I use the scripts created by the wrapper (e.g. /working/jobs/multiply/___run.sh) and create a slurm job, the output will be written in the ___stdout.log file, but with pegasus-run it won't.
I will try to see if I can spot the issue in the pegasus source code.
@hassanzadeh : Your directions tell me how to run a pipeline, but they don't tell me what you want me to do with my admin-ssh powers. What are you trying to find out?
Then, Once you run submit the job,
watch -n 1 squeue -a -u $USER to get the id of the jobs that are submitted.
Once you get the job id, you can see it's status with: scontrol show job #jobid
You should see something like the following, where the Command shows the script is.
UserId=mdehaven(7631) GroupId=div31(1026) MCS_label=N/A
Priority=1 Nice=0 Account=saral QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=02:46:01 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2020-07-08T07:44:54 EligibleTime=2020-07-08T07:44:54
AccrueTime=Unknown
StartTime=2020-07-08T07:44:54 EndTime=2020-07-09T07:44:54 Deadline=N/A
PreemptEligibleTime=2020-07-08T07:54:54 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-08T07:44:54
Partition=saral AllocNode:Sid=saral-sub-01.isi.edu:27465
ReqNodeList=(null) ExcNodeList=(null)
NodeList=saga22
BatchHost=saga22
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=8G,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/address/to/job
WorkDir=/nas/material/users/mdehaven/expts/n001/pashto_e2e
Power=
Follow up, it looks like the file referred to in my previous comment is actually removed immediately after the job submission, so I guess there is no point going to that direction.
Closing this as it appears to solved by @hassanzadeh
We currently try to do this here:
but the output is always empty:
https://github.com/isi-vista/vista-pegasus-wrapper/blob/master/pegasus_wrapper/workflow.py#L205