oar-team / oar3

OAR: versatile resource and job manager for cluster (third generation)
Other
8 stars 11 forks source link

Jobs dependencies bugged (scheduler failing) #47

Closed bzizou closed 3 months ago

bzizou commented 3 months ago

OAR no more schedules any jobs when a dependency is required:

[    INFO] [2024-03-27 09:47:47,941] [oar.kamelot::schedule_id_jobs_ct:431]: job(1905) in dependencies for job(1930) is in error state
[    INFO] [2024-03-27 09:47:47,942] [oar.kamelot::schedule_id_jobs_ct:461]: job(1930) can't be scheduled due to dependencies
[   DEBUG] [2024-03-27 09:47:47,942] [oar.kamelot::schedule_id_jobs_ct:422]: Schedule job:1931                             
Traceback (most recent call last):                                                                                                      
  File "/usr/lib/oar/kao", line 33, in <module>
    sys.exit(load_entry_point('oar==3.0.0.dev11', 'console_scripts', 'kao')())           
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                 
  File "/usr/lib/python3/dist-packages/oar/kao/kao.py", line 18, in main                                     
    return meta_schedule(session, config, config["METASCHEDULER_MODE"])                       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^             
  File "/usr/lib/python3/dist-packages/oar/kao/meta_sched.py", line 954, in meta_schedule        
    call_internal_scheduler(                                                                                                            
  File "/usr/lib/python3/dist-packages/oar/kao/meta_sched.py", line 775, in call_internal_scheduler                                     
    internal_schedule_cycle(                                          
  File "/usr/lib/python3/dist-packages/oar/kao/kamelot.py", line 97, in internal_schedule_cycle                 
    schedule_id_jobs_ct(                                                                                                                
  File "/usr/lib/python3/dist-packages/oar/kao/scheduling.py", line 445, in schedule_id_jobs_ct     
    job_dep_stop_time = job_dep.start_time + job_dep.walltime                                                                           
                                             ^^^^^^^^^^^^^^^^                                                                           
augu5te commented 3 months ago

Fixed: a dependency with job in Error is ignored (same behavior that OAR2)