When a workflow is deleted, individual jobs have status "failed"

alahiff commented 5 years ago

Of course, their status should be "deleted"

alahiff commented 5 years ago

Example: before deletion:

$ prominence list
ID      NAME                                                    CREATED               STATUS    ELAPSED      IMAGE      CMD                                              
25754   lammps-stfc-many-with-db12-and-coremark/lammps-stfc/6   2019-11-28 19:20:33   running   0+11:57:11   python:2   python DIRACbenchmark.py --iterations=4 wholenode
25755   lammps-stfc-many-with-db12-and-coremark/lammps-stfc/7   2019-11-28 19:20:33   running   0+11:57:11   python:2   python DIRACbenchmark.py --iterations=4 wholenode
25756   lammps-stfc-many-with-db12-and-coremark/lammps-stfc/8   2019-11-28 19:20:36   running   0+11:58:12   python:2   python DIRACbenchmark.py --iterations=4 wholenode

After:

$ prominence list
ID      NAME                                                    CREATED               STATUS   ELAPSED      IMAGE      CMD                                              
25754   lammps-stfc-many-with-db12-and-coremark/lammps-stfc/6   2019-11-28 19:20:33   failed   0+00:01:03   python:2   python DIRACbenchmark.py --iterations=4 wholenode
25755   lammps-stfc-many-with-db12-and-coremark/lammps-stfc/7   2019-11-28 19:20:33   failed   0+00:01:03   python:2   python DIRACbenchmark.py --iterations=4 wholenode
25756   lammps-stfc-many-with-db12-and-coremark/lammps-stfc/8   2019-11-28 19:20:36   failed   0+00:01:00   python:2   python DIRACbenchmark.py --iterations=4 wholenode

Notice that the elapsed time has also changed.

alahiff commented 5 years ago

Status fixed in https://github.com/prominence-eosc/prominence/commit/1d95f171d18d360b3cdf7b7aaf7334656714161d, but elapsed time still not correct. Events also incorrect, e.g. for a job which ran for almost 12 hours:

  "events": {
    "createTime": "2019-11-28 19:20:33",
    "startTime": "2019-11-28 19:27:03",
    "endTime": "2019-11-28 19:28:06"
  },

alahiff commented 5 years ago

There doesn't appear to be any end epoch listed in a job created by DAGMan where the DAG job was deleted. The routed job has LastVacateTime, so added LastVacateTime to PROMINENCE_ATTRS_TO_COPY.

Need to update list_jobs to check for LastVacateTime and use this if necessary.

alahiff commented 4 years ago

Original job:

# condor_history -m 1 25754 -af EnteredCurrentStatus
1574969286

which corresponds to Thursday, 28 November 2019 19:28:06. For the routed job:

# condor_history -m 1 25760 -af EnteredCurrentStatus
1575012594

which correspnds to Friday, 29 November 2019 07:29:54, which is what we want.

Why is EnteredCurrentStatus updated on the routed job but not the original? Maybe adding EnteredCurrentStatus to PROMINENCE_ATTRS_TO_COPY and removing LastVacateTime will help?

alahiff commented 4 years ago

Note that even with EnteredCurrentStatus in PROMINENCE_ATTRS_TO_COPY, at least sometimes it wasn't copied to the original job ClassAd, i.e. a completed job would have EnteredCurrentStatus as the time the job started running.

Added LastVacateTime back into PROMINENCE_ATTRS_TO_COPY, and implemented changes https://github.com/prominence-eosc/prominence/commit/a5fce7ef26c32ad8bb9de4f7a8e6aaf5968ce914 to check if LastVacateTime gives a sensible time to use as a job's endTime.

prominence-eosc / prominence

When a workflow is deleted, individual jobs have status "failed" #98