Stalled hbal after long running replace-disks

GoogleCodeExporter commented 9 years ago

# gnt-cluster --version
gnt-cluster (ganeti v2.9.3) 2.9.3

# gnt-cluster version
Software version: 2.9.3
Internode protocol: 2090000
Configuration format: 2090000
OS api version: 20
Export interface: 0
VCS version: v2.9.3

# hspace --version
hspace (ganeti) version v2.9.3
compiled with ghc 7.4
running on linux x86_64

What distribution are you using? Debian Wheezy

What steps will reproduce the problem?
1. hbal -L -X (that leads to replace-disks)
2. gnt-cluster verify (might not be needed/relevant)
3. gnt-cluster verify (might not be needed/relevant)

What is the expected output? What do you see instead?
hbal get's stalled after some long running replace-disks command. hbal should 
be able to proceed and re-queue jobs.

Please provide any additional information below.
relevant gnt-job list output:
148259 success INSTANCE_MIGRATE(vm1.gr)
148260 success INSTANCE_REPLACE_DISKS(vm2),INSTANCE_MIGRATE(vm2)
148261 success INSTANCE_MIGRATE(vm3.gr)
148264 success INSTANCE_REPLACE_DISKS(vm4),INSTANCE_MIGRATE(vm4)
148265 success INSTANCE_MIGRATE(vm6)
148268 success INSTANCE_MIGRATE(vm5.gr),INSTANCE_REPLACE_DISKS(vm5.gr)
148270 success CLUSTER_VERIFY
148271 success CLUSTER_VERIFY_CONFIG
148272 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148273 success CLUSTER_VERIFY
148274 success CLUSTER_VERIFY_CONFIG
148275 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148277 success INSTANCE_REPLACE_DISKS(vm7),INSTANCE_MIGRATE(vm7)
148285 success CLUSTER_VERIFY
148286 success CLUSTER_VERIFY_CONFIG
148287 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148300 success CLUSTER_VERIFY
148302 success CLUSTER_VERIFY_CONFIG
148303 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)

148268
<snip>
    OP_INSTANCE_MIGRATE
      Status: success
      Processing start: 2014-03-27 12:21:28.209916
      Execution start:  2014-03-27 12:21:49.471006
      Processing end:   2014-03-27 12:22:25.294181
<snip>
    OP_INSTANCE_REPLACE_DISKS
      Status: success
      Processing start: 2014-03-27 12:22:25.464814
      Execution start:  2014-03-27 12:22:25.593059
      Processing end:   2014-03-27 12:31:05.106991

14272
    OP_CLUSTER_VERIFY_GROUP
      Status: success
      Processing start: 2014-03-27 12:25:05.371858
      Execution start:  2014-03-27 12:31:06.099407
      Processing end:   2014-03-27 12:31:13.275953

148275
    OP_CLUSTER_VERIFY_GROUP
      Status: success
      Processing start: 2014-03-27 12:30:05.412300
      Execution start:  2014-03-27 12:31:06.081994
      Processing end:   2014-03-27 12:31:13.283875

148277 was created by hbal and has finished properly:

<snip>
    OP_INSTANCE_REPLACE_DISKS
      Status: success
      Processing start: 2014-03-27 12:31:14.552945
      Execution start:  2014-03-27 12:31:29.330166
      Processing end:   2014-03-27 12:44:14.963871
<snip>
    OP_INSTANCE_MIGRATE
      Status: success
      Processing start: 2014-03-27 12:44:15.152519
      Execution start:  2014-03-27 12:44:15.434923
      Processing end:   2014-03-27 12:44:45.168525

148285
    OP_CLUSTER_VERIFY
      Status: success
      Processing start: 2014-03-27 13:00:04.963330
      Execution start:  2014-03-27 13:00:05.100860
      Processing end:   2014-03-27 13:00:06.092148

148287
    OP_CLUSTER_VERIFY_GROUP
      Status: success
      Processing start: 2014-03-27 13:00:06.090845
      Execution start:  2014-03-27 13:00:06.943139
      Processing end:   2014-03-27 13:00:16.951912

in hbal output I can see:
Cluster score improved from 7.24502595 to 1.73841313
Solution length=12
Executing jobset for instances vm1.gr
Got job IDs148259
Executing jobset for instances vm2,vm3.gr
Got job IDs148260,148261
Executing jobset for instances vm4,vm6
Got job IDs148264,148265
Executing jobset for instances vm5.gr
Got job IDs148268
Executing jobset for instances vm7
Got job IDs148277
(after waiting >45' I press ctrl+c)
^CCancel request registered, will exit at the end of the current job set...
^CCancel request registered, will exit at the end of the current job set...
^CCancel request registered, will exit at the end of the current job set...
(hbal needs to be killed here)

it's been >45' since hbal sent the last job (148277) to queue.
maybe hbal hit some timeout and it doesn't move on ?

oh and please add a space after "Got job IDs"
Got job IDs148264,148265 -> Got job IDs 148264,148265

--- src/Ganeti/HTools/Program/Hbal.hs   2014-03-27 13:18:35.000000000 +0200
+++ src/Ganeti/HTools/Program/Hbal.hs   2014-03-27 13:18:50.000000000 +0200
@@ -201,7 +201,7 @@
   let jobs = map (\(_, idx, move, _) ->
                     map anno $ Cluster.iMoveToJob nl il idx move) js
       descr = map (\(_, idx, _, _) -> Container.nameOf il idx) js
-      logfn = putStrLn . ("Got job IDs" ++) . commaJoin . map (show . 
fromJobId)
+      logfn = putStrLn . ("Got job IDs " ++) . commaJoin . map (show . 
fromJobId)
   putStrLn $ "Executing jobset for instances " ++ commaJoin descr
   jrs <- bracket (L.getClient master) L.closeClient $
          Jobs.execJobsWait jobs logfn

Original issue reported on code.google.com by kargig.l...@gmail.com on 27 Mar 2014 at 11:46

GoogleCodeExporter commented 9 years ago

Hi, I bumped into the same issue. Here's some context details:

# gnt-cluster --version
gnt-cluster (ganeti v2.9.3) 2.9.3

# gnt-cluster version
Software version: 2.9.3
Internode protocol: 2090000
Configuration format: 2090000
OS api version: 20
Export interface: 0
VCS version: v2.9.3

# hspace --version
hspace (ganeti) version v2.9.3
compiled with ghc 7.4
running on linux x86_64

Cluster's nodes are running Debian Wheezy 7.8

What steps will reproduce the problem?

running 'hbal -L -X' where a 'replace-disks' job is included. That particular 
job affected an instance with 200GB disk.

What is the expected output? What do you see instead?

Expected behavior for hbal would be to execute the whole series of jobs 
calculated to rebalance the cluster. Instead, hbal stalls after successfully 
executing the first job which is replacing-disks and migrating an instance with 
200GB disk.

Please provide any additional information below:

775710 success 
INSTANCE_REPLACE_DISKS(problematic.instance),INSTANCE_MIGRATE(problematic.instan
ce)
775711 success INSTANCE_QUERY_DATA
775714 success CLUSTER_VERIFY
775715 success CLUSTER_VERIFY_CONFIG
775716 success CLUSTER_VERIFY_GROUP(5d3aed89-4f19-4a87-8d0d-cff6159a6926)
775717 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
775718 success INSTANCE_QUERY_DATA

# gnt-job info 775710
Job ID: 775710
  Status: success
  Received:         2015-06-22 13:12:17.127187
  Processing start: 2015-06-22 13:12:17.262588 (delta 0.135401s)
  Processing end:   2015-06-22 13:39:40.079557 (delta 1642.816969s)
  Total processing time: 1642.952370 seconds
  Opcodes:
    OP_INSTANCE_REPLACE_DISKS
      Status: success
      Processing start: 2015-06-22 13:12:17.262588
      Execution start:  2015-06-22 13:12:17.431253
      Processing end:   2015-06-22 13:37:31.807807

OP_INSTANCE_MIGRATE
      Status: success
      Processing start: 2015-06-22 13:37:32.058913
      Execution start:  2015-06-22 13:38:17.707905
      Processing end:   2015-06-22 13:39:40.079539

I had to manually kill hbal process at 2015-06-22 14:26, and then re-issue it 
to execute the rest of the commands.

Original comment by alafento...@gmail.com on 22 Jun 2015 at 12:51

GoogleCodeExporter commented 9 years ago

How long is hbal stalled after this?

hbal is backing off the intervals at which it checks for
the status of its submitted jobs till, for very long running
jobs, it checks only once every 15 seconds. (Also, the next
batch of jobs is only submitted, once the previous was recognized
to have finished successfully.)

Original comment by aeh...@google.com on 22 Jun 2015 at 1:06

olopez32 / ganeti

Stalled hbal after long running replace-disks #781