Open alahiff opened 5 years ago
Example: before deletion:
$ prominence list
ID NAME CREATED STATUS ELAPSED IMAGE CMD
25754 lammps-stfc-many-with-db12-and-coremark/lammps-stfc/6 2019-11-28 19:20:33 running 0+11:57:11 python:2 python DIRACbenchmark.py --iterations=4 wholenode
25755 lammps-stfc-many-with-db12-and-coremark/lammps-stfc/7 2019-11-28 19:20:33 running 0+11:57:11 python:2 python DIRACbenchmark.py --iterations=4 wholenode
25756 lammps-stfc-many-with-db12-and-coremark/lammps-stfc/8 2019-11-28 19:20:36 running 0+11:58:12 python:2 python DIRACbenchmark.py --iterations=4 wholenode
After:
$ prominence list
ID NAME CREATED STATUS ELAPSED IMAGE CMD
25754 lammps-stfc-many-with-db12-and-coremark/lammps-stfc/6 2019-11-28 19:20:33 failed 0+00:01:03 python:2 python DIRACbenchmark.py --iterations=4 wholenode
25755 lammps-stfc-many-with-db12-and-coremark/lammps-stfc/7 2019-11-28 19:20:33 failed 0+00:01:03 python:2 python DIRACbenchmark.py --iterations=4 wholenode
25756 lammps-stfc-many-with-db12-and-coremark/lammps-stfc/8 2019-11-28 19:20:36 failed 0+00:01:00 python:2 python DIRACbenchmark.py --iterations=4 wholenode
Notice that the elapsed time has also changed.
Status fixed in https://github.com/prominence-eosc/prominence/commit/1d95f171d18d360b3cdf7b7aaf7334656714161d, but elapsed time still not correct. Events also incorrect, e.g. for a job which ran for almost 12 hours:
"events": {
"createTime": "2019-11-28 19:20:33",
"startTime": "2019-11-28 19:27:03",
"endTime": "2019-11-28 19:28:06"
},
There doesn't appear to be any end epoch listed in a job created by DAGMan where the DAG job was deleted. The routed job has LastVacateTime
, so added LastVacateTime
to PROMINENCE_ATTRS_TO_COPY
.
Need to update list_jobs
to check for LastVacateTime
and use this if necessary.
Original job:
# condor_history -m 1 25754 -af EnteredCurrentStatus
1574969286
which corresponds to Thursday, 28 November 2019 19:28:06. For the routed job:
# condor_history -m 1 25760 -af EnteredCurrentStatus
1575012594
which correspnds to Friday, 29 November 2019 07:29:54, which is what we want.
Why is EnteredCurrentStatus updated on the routed job but not the original? Maybe adding EnteredCurrentStatus
to PROMINENCE_ATTRS_TO_COPY
and removing LastVacateTime
will help?
Note that even with EnteredCurrentStatus
in PROMINENCE_ATTRS_TO_COPY
, at least sometimes it wasn't copied to the original job ClassAd, i.e. a completed job would have EnteredCurrentStatus
as the time the job started running.
Added LastVacateTime
back into PROMINENCE_ATTRS_TO_COPY
, and implemented changes https://github.com/prominence-eosc/prominence/commit/a5fce7ef26c32ad8bb9de4f7a8e6aaf5968ce914 to check if LastVacateTime
gives a sensible time to use as a job's endTime
.
Of course, their status should be "deleted"