mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

Error message for successfully terminated jobs on LiDO3 #219

Open JonasRieger opened 5 years ago

JonasRieger commented 5 years ago

I got a confusing error message:

> getStatus()
Status for 100 jobs at 2019-02-08 16:09:42:
  Submitted    : 100 (100.0%)
  -- Queued    :   0 (  0.0%)
  -- Started   : 100 (100.0%)
  ---- Running :   0 (  0.0%)
  ---- Done    :  99 ( 99.0%)
  ---- Error   :   1 (  1.0%)
  ---- Expired :   0 (  0.0%)
> getErrorMessages()
   job.id terminated error                                                                  message
1:     28       TRUE  TRUE Error in buildMatrix(similarities = similarities) : object 'x' not found
> getLog(28)
...
[153] "### [bt]: Job terminated successfully [batchtools job.id=28]"                                   
[154] "### [bt]: Calculation finished!"  
> getJobStatus(ids = 26:30)[, c("submitted", "started", "done", "batch.id", "time.queued", "time.running")]
             submitted             started                done batch.id      time.queued  time.running
1: 2019-02-07 08:37:53 2019-02-07 08:37:57 2019-02-07 21:59:15  3460467      4.3320 secs 48077.95 secs
2: 2019-02-07 08:37:53 2019-02-07 08:37:57 2019-02-07 21:50:38  3460468      4.2805 secs 47560.80 secs
3: 2019-02-07 08:37:53 2019-02-06 16:11:19 2019-02-07 14:22:28  3460469 -59194.1809 secs 79869.04 secs
4: 2019-02-07 08:37:53 2019-02-07 08:37:57 2019-02-07 21:51:33  3460470      4.2044 secs 47615.03 secs
5: 2019-02-07 08:37:53 2019-02-07 08:37:57 2019-02-07 21:51:20  3460471      4.1385 secs 47602.88 secs

All jobs returned a valid result, the error message above (x not found) is exactly the same i got yesterday - quite rightly - because of a typo. Afterwards i deleted the whole registry (folder) by hand, created a identically named one with the typo corrected and submitted all jobs once again.

My question is whether the error message could have been cached somehow - and sneaked into my new registry.

mllg commented 5 years ago

Please make a backup of the registry, I want to have a look at it next week.

mllg commented 5 years ago

After the backup, just re-submit the job in question. And all other jobs with "time.queued" < 0.