oar-team / oar3

OAR: versatile resource and job manager for cluster (third generation)
Other
8 stars 11 forks source link

Jobs in error are not displayed by oarstat #48

Open bzizou opened 3 months ago

bzizou commented 3 months ago
oar=# select state from jobs where job_id = 1905;   
 state                                              
-------                                             
 Error                                              
root@dahu-oar3:~# oarstat -fj 1905
root@dahu-oar3:~#
bzizou commented 3 months ago

Seems that is not always true, it might only be related to this particular 1905 job

bzizou commented 3 months ago

Sql for this job:

 job_id | array_id | array_index |                                                                 initial_request                                                                 |         job_name          | job_env | job_type | info_type  | state | reservation |           message           | scheduler_info | job_user | project | job_group |                command                 | exit_code | queue_name |  properties   |                launching_directory                | submission_time | start_time | stop_time  | file_id | accounted | notify | assigned_moldable_job | checkpoint | checkpoint_signal |                 stdout_file                  |                 stderr_file                  | resubmit_job_id | suspended 
--------+----------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+---------+----------+------------+-------+-------------+-----------------------------+----------------+----------+---------+-----------+----------------------------------------+-----------+------------+---------------+---------------------------------------------------+-----------------+------------+------------+---------+-----------+--------+-----------------------+------------+-------------------+----------------------------------------------+----------------------------------------------+-----------------+-----------
   1905 |     1905 |           1 | -a 1904 -S ./apptainer_multifast_128_procs_v3.oar -n test_apptainer_multinodes -t devel -l /nodes=4/core=32,walltime=00:30:00 --project pr-test | test_apptainer_multinodes |         | PASSIVE  | dahu-oar3: | Error | None        | Job killed by Leon directly |                | arrondeb | test    |           | ./apptainer_multifast_128_procs_v3.oar |           | default    | devel = 'YES' | /home/arrondeb/WORKSPACE/Codes/DNS/multifast_test |      1711471145 | 1711483238 | 1711483238 |         | YES       |        |                     0 |          0 |                12 | OAR.test_apptainer_multinodes.%jobid%.stdout | OAR.test_apptainer_multinodes.%jobid%.stderr |               0 | NO
(1 row)