oar-team / oar

OAR is a versatile resource and task manager (also called a batch scheduler) for clusters and other computing infrastructures.
http://oar.imag.fr/
GNU General Public License v2.0
44 stars 23 forks source link

DBD::Pg::db begin_work failed: Already in a transaction at /usr/share/perl5/OAR/IO.pm line 7671 #50

Open npf opened 9 years ago

npf commented 9 years ago

Race condition with judas ?

[debug] [2015-07-13 16:57:33.873] [MetaSched] Starting Meta Scheduler
[debug] [2015-07-13 16:57:33.878] [MetaSched] Retrieve information for already scheduler reservations from database before flush (keep assign resources)
[debug] [2015-07-13 16:57:33.883] [MetaSched] Initialize the gantt structure
[debug] [2015-07-13 16:57:33.888] [MetaSched] Begin processing of already scheduled reservations (accepted with resources assigned)
[debug] [2015-07-13 16:57:33.888] [MetaSched] End processing of already scheduled reservations
[debug] [2015-07-13 16:57:33.889] [MetaSched] Begin processing of current jobs
[debug] [2015-07-13 16:57:33.892] [MetaSched] [128] Add job in database
[debug] [2015-07-13 16:57:33.894] [MetaSched] [128] job is (0,u:,,)
[debug] [2015-07-13 16:57:33.894] [MetaSched] [128] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:33.900] [MetaSched] [125] Add job in database
[debug] [2015-07-13 16:57:33.907] [MetaSched] [125] job is (0,u:,,)
[debug] [2015-07-13 16:57:33.908] [MetaSched] [125] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:33.910] [MetaSched] End processing of current jobs
[debug] [2015-07-13 16:57:33.910] [MetaSched] Begin processing of waiting reservations (accepted reservations which do not have assigned resources yet)
[debug] [2015-07-13 16:57:33.910] [MetaSched] End processing of waiting reservations
[debug] [2015-07-13 16:57:33.912] [MetaSched] Queue admin: No job
[debug] [2015-07-13 16:57:33.913] [MetaSched] Queue kinovis: No job
[debug] [2015-07-13 16:57:33.914] [MetaSched] Queue default: Launching scheduler oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas at time 2015-07-13 16:57:34
[debug] [2015-07-13 16:57:34.163] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Starting scheduler for queue default at time 1436799454
[debug] [2015-07-13 16:57:34.168] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Begin phase 1 (running jobs)
[debug] [2015-07-13 16:57:34.182] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [125] add job occupation in gantt of container 0
[debug] [2015-07-13 16:57:34.183] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [128] add job occupation in gantt of container 0
[debug] [2015-07-13 16:57:34.183] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End phase 1 (running jobs)
[debug] [2015-07-13 16:57:34.187] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Begin EnergySaving phase
[debug] [2015-07-13 16:57:34.189] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End EnergySaving phase
[debug] [2015-07-13 16:57:34.189] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Begin phase 2 (waiting jobs)
[debug] [2015-07-13 16:57:34.195] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] Start scheduling (Karma note = 1.33303187813989)
[debug] [2015-07-13 16:57:34.206] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] find_first_hole with a timeout of 7.5
[debug] [2015-07-13 16:57:34.210] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] No enough matching resources (no_matching_slot)
[debug] [2015-07-13 16:57:34.210] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] end scheduling
[debug] [2015-07-13 16:57:34.210] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] Start scheduling (Karma note = 1.33303187813989)
[debug] [2015-07-13 16:57:34.221] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] find_first_hole with a timeout of 7.5
[debug] [2015-07-13 16:57:34.224] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] add job occupation in gantt of container 0
[debug] [2015-07-13 16:57:34.230] [MetaSched] Read on the scheduler output:SCHEDRUN JOB_ID=131 MOLDABLE_JOB_ID=131 RESOURCES=681,678,617,660,712,658,673,652,688,630,610,670,614,685,639,629,657,699,621,675,703,680,638,615,671,662,648,623,711,694,696,674,714,659,653,672,661,641,704,634,683,716,718,665,666,636,647,613,618,640,625,690,655,713,687,663,706,692,715,624,609,656,651,650,695,645,702,708,669,654,686,611,676,720,612,717,644,700,643,689,628,719,705,649,642,620,668,632,698,664,693,637,677,667,616,635,627,626,709,697,679,701,633,707,631,622,646,684,710,619,691,682
[debug] [2015-07-13 16:57:34.230] [MetaSched] Early launch the job 131
[debug] [2015-07-13 16:57:34.232] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] end scheduling
[debug] [2015-07-13 16:57:34.233] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End phase 2 (waiting jobs)
[debug] [2015-07-13 16:57:34.240] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End of scheduler for queue default
[debug] [2015-07-13 16:57:34.250] [MetaSched] Notify almighty to launch the job 131
[debug] [2015-07-13 16:57:34.250] [Almighty] Appendice received a connection
[debug] [2015-07-13 16:57:34.250] [Almighty] Appendice has read on the socket : OARRUNJOB_131
[debug] [2015-07-13 16:57:34.251] [Almighty][bipbip_launcher] Read on pipe: OARRUNJOB_131
[debug] [2015-07-13 16:57:34.253] [Almighty][bipbip_launcher] Run process: /usr/lib/oar/bipbip 131
[debug] [2015-07-13 16:57:34.253] [Almighty][bipbip_launcher] Nb running bipbip: 1/25; Waiting processes(0): 
[debug] [2015-07-13 16:57:34.253] [Almighty][bipbip_launcher] Check bipbip process duration: job=131, pid=16104, time=1436799454, current_time=1436799454, duration=0s
[debug] [2015-07-13 16:57:34.255] [MetaSched] Queue default: begin processing of waiting reservations
[debug] [2015-07-13 16:57:34.256] [MetaSched] Queue default: end processing of waiting reservations
[debug] [2015-07-13 16:57:34.256] [MetaSched] Queue default: begin processing of new reservations
[debug] [2015-07-13 16:57:34.257] [MetaSched] Queue default: end processing of new reservations
[debug] [2015-07-13 16:57:34.258] [MetaSched] Queue besteffort: No job
[debug] [2015-07-13 16:57:34.258] [MetaSched] Begin precessing of besteffort jobs to kill
[debug] [2015-07-13 16:57:34.260] [MetaSched] End precessing of besteffort jobs to kill
[debug] [2015-07-13 16:57:34.261] [MetaSched] Begin processing of jobs to launch (start time <= 2015-07-13 16:57:34)
[debug] [2015-07-13 16:57:34.265] [MetaSched] End processing of jobs to launch
[debug] [2015-07-13 16:57:34.287] [MetaSched] End of Meta Scheduler
[debug] [2015-07-13 16:57:34.295] [Almighty] /usr/lib/oar/oar_meta_sched terminated :
[debug] [2015-07-13 16:57:34.295] [Almighty] Exit value : 0
[debug] [2015-07-13 16:57:34.295] [Almighty] Signal num : 0
[debug] [2015-07-13 16:57:34.295] [Almighty] Core dumped : 0
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Time update]
[debug] [2015-07-13 16:57:34.295] [Almighty] Timeouts check : 2015-7-13 16:57:34
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Qget]
[debug] [2015-07-13 16:57:34.295] [Almighty] Got command Time, 99 remaining
[debug] [2015-07-13 16:57:34.295] [Almighty] Command queue : Time
[debug] [2015-07-13 16:57:34.295] [Almighty] Qtype = [Time]
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Time update]
[debug] [2015-07-13 16:57:34.295] [Almighty] Timeouts check : 2015-7-13 16:57:34
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Qget]
[debug] [2015-07-13 16:57:34.298] [Hulot] Got request 'CHECK'
[debug] [2015-07-13 16:57:34.502] [bipbip 131] JOB: 131; User: neyron; Command: sleep 12h ==> hosts :[kinovis-c1.grenoble.grid5000.fr kinovis-c2.grenoble.grid5000.fr]
Possible precedence issue with control flow operator at /usr/bin/taktuk line 940.
[TAKTUK OUTPUT] digserv: kinovis-c1.grenoble.grid5000.fr (16109): connector > ssh: connect to host kinovis-c1.grenoble.grid5000.fr port 6667: No route to host
[TAKTUK OUTPUT] digserv: kinovis-c1.grenoble.grid5000.fr (16109): state > Connection failed
[TAKTUK OUTPUT] digserv: kinovis-c2.grenoble.grid5000.fr (16110): connector > ssh: connect to host kinovis-c2.grenoble.grid5000.fr port 6667: No route to host
[TAKTUK OUTPUT] digserv: kinovis-c2.grenoble.grid5000.fr (16110): state > Connection failed
[error] [2015-07-13 16:57:37.665] [bipbip 131] /!\ Some nodes are inaccessible (CPUSET_ERROR):
kinovis-c1.grenoble.grid5000.fr kinovis-c2.grenoble.grid5000.fr
[debug] [2015-07-13 16:57:37.670] [Almighty] Appendice received a connection
[debug] [2015-07-13 16:57:37.671] [Almighty] Appendice has read on the socket : ChState
[debug] [2015-07-13 16:57:37.671] [Almighty] Got command ChState, 99 remaining
[debug] [2015-07-13 16:57:37.672] [Almighty] Got command Time, 98 remaining
[debug] [2015-07-13 16:57:37.672] [Almighty] Command queue : ChState Time
[debug] [2015-07-13 16:57:37.672] [Almighty] Qtype = [ChState]
[debug] [2015-07-13 16:57:37.672] [Almighty] Current state [Change node state]
[debug] [2015-07-13 16:57:37.672] [Almighty] Launching command : [/usr/lib/oar/NodeChangeState]
[debug] [2015-07-13 16:57:37.680] [Almighty][bipbip_launcher] Process 16104 for the job 131 ends with exit_code=2, duration=3s
[debug] [2015-07-13 16:57:37.681] [Almighty][bipbip_launcher] Nb running bipbip: 0/25; Waiting processes(0): 
[debug] [2015-07-13 16:57:37.897] [NodeChangeState] Check event for the job 131 with type CPUSET_ERROR
[debug] [2015-07-13 16:57:37.920] [Almighty] Appendice received a connection
[debug] [2015-07-13 16:57:37.921] [Almighty] Appendice has read on the socket : ChState
[info] [2015-07-13 16:57:37.979] [NodeChangeState] error (CPUSET_ERROR) on the nodes:

kinovis-c1.grenoble.grid5000.fr kinovis-c2.grenoble.grid5000.fr

So we are suspecting them
[debug] [2015-07-13 16:57:37.979] [Judas] Mail is not configured
DBD::Pg::db begin_work failed: Already in a transaction at /usr/share/perl5/OAR/IO.pm line 7671.
[info] [2015-07-13 16:57:37.997] [NodeChangeState] We resubmit the job 131 (new id = 132) because the event was CPUSET_ERROR and the job is neither a reservation nor an interactive job.
[debug] [2015-07-13 16:57:38.001] [NodeChangeState] number of resources to change state = 0
commit ineffective with AutoCommit enabled at /usr/share/perl5/OAR/IO.pm line 7687.
[debug] [2015-07-13 16:57:38.006] [Almighty] /usr/lib/oar/NodeChangeState terminated :
[debug] [2015-07-13 16:57:38.006] [Almighty] Exit value : 1
[debug] [2015-07-13 16:57:38.006] [Almighty] Signal num : 0
[debug] [2015-07-13 16:57:38.006] [Almighty] Core dumped : 0
[debug] [2015-07-13 16:57:38.006] [Almighty] Current state [Scheduler]
[debug] [2015-07-13 16:57:38.007] [Almighty] Launching command : [/usr/lib/oar/NodeChangeState]
[debug] [2015-07-13 16:57:38.231] [NodeChangeState] Check event for the job 131 with type RESUBMIT_JOB_AUTOMATICALLY
[debug] [2015-07-13 16:57:38.237] [NodeChangeState] number of resources to change state = 0
[debug] [2015-07-13 16:57:38.246] [Almighty] /usr/lib/oar/NodeChangeState terminated :
[debug] [2015-07-13 16:57:38.246] [Almighty] Exit value : 0
[debug] [2015-07-13 16:57:38.246] [Almighty] Signal num : 0
[debug] [2015-07-13 16:57:38.246] [Almighty] Core dumped : 0
[debug] [2015-07-13 16:57:38.246] [Almighty] Launching command : [/usr/lib/oar/oar_meta_sched]
[debug] [2015-07-13 16:57:38.505] [MetaSched] Starting Meta Scheduler
[debug] [2015-07-13 16:57:38.510] [MetaSched] Retrieve information for already scheduler reservations from database before flush (keep assign resources)
[debug] [2015-07-13 16:57:38.515] [MetaSched] Initialize the gantt structure
[debug] [2015-07-13 16:57:38.519] [MetaSched] Begin processing of already scheduled reservations (accepted with resources assigned)
[debug] [2015-07-13 16:57:38.519] [MetaSched] End processing of already scheduled reservations
[debug] [2015-07-13 16:57:38.519] [MetaSched] Begin processing of current jobs
[debug] [2015-07-13 16:57:38.523] [MetaSched] [128] Add job in database
[debug] [2015-07-13 16:57:38.525] [MetaSched] [128] job is (0,u:,,)
[debug] [2015-07-13 16:57:38.525] [MetaSched] [128] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:38.531] [MetaSched] [125] Add job in database
[debug] [2015-07-13 16:57:38.538] [MetaSched] [125] job is (0,u:,,)
[debug] [2015-07-13 16:57:38.538] [MetaSched] [125] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:38.539] [MetaSched] End processing of current jobs
[debug] [2015-07-13 16:57:38.540] [MetaSched] Begin processing of waiting reservations (accepted reservations which do not have assigned resources yet)
[debug] [2015-07-13 16:57:38.541] [MetaSched] End processing of waiting reservations
[debug] [2015-07-13 16:57:38.542] [MetaSched] Queue admin: No job
[debug] [2015-07-13 16:57:38.543] [MetaSched] Queue kinovis: No job
[debug] [2015-07-13 16:57:38.544] [MetaSched] Queue default: Launching scheduler oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas at time 2015-07-13 16:57:39
capitn commented 9 years ago

I already saw this warning message. This part of the code can be called from different ways and one of them is already inside a transaction so no need to create another one...

npf commented 8 years ago

The warning sounds not harmful... Changed milestone to not for now.