Open GoogleCodeExporter opened 9 years ago
Hi, I bumped into the same issue. Here's some context details:
# gnt-cluster --version
gnt-cluster (ganeti v2.9.3) 2.9.3
# gnt-cluster version
Software version: 2.9.3
Internode protocol: 2090000
Configuration format: 2090000
OS api version: 20
Export interface: 0
VCS version: v2.9.3
# hspace --version
hspace (ganeti) version v2.9.3
compiled with ghc 7.4
running on linux x86_64
Cluster's nodes are running Debian Wheezy 7.8
What steps will reproduce the problem?
running 'hbal -L -X' where a 'replace-disks' job is included. That particular
job affected an instance with 200GB disk.
What is the expected output? What do you see instead?
Expected behavior for hbal would be to execute the whole series of jobs
calculated to rebalance the cluster. Instead, hbal stalls after successfully
executing the first job which is replacing-disks and migrating an instance with
200GB disk.
Please provide any additional information below:
775710 success
INSTANCE_REPLACE_DISKS(problematic.instance),INSTANCE_MIGRATE(problematic.instan
ce)
775711 success INSTANCE_QUERY_DATA
775714 success CLUSTER_VERIFY
775715 success CLUSTER_VERIFY_CONFIG
775716 success CLUSTER_VERIFY_GROUP(5d3aed89-4f19-4a87-8d0d-cff6159a6926)
775717 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
775718 success INSTANCE_QUERY_DATA
# gnt-job info 775710
Job ID: 775710
Status: success
Received: 2015-06-22 13:12:17.127187
Processing start: 2015-06-22 13:12:17.262588 (delta 0.135401s)
Processing end: 2015-06-22 13:39:40.079557 (delta 1642.816969s)
Total processing time: 1642.952370 seconds
Opcodes:
OP_INSTANCE_REPLACE_DISKS
Status: success
Processing start: 2015-06-22 13:12:17.262588
Execution start: 2015-06-22 13:12:17.431253
Processing end: 2015-06-22 13:37:31.807807
OP_INSTANCE_MIGRATE
Status: success
Processing start: 2015-06-22 13:37:32.058913
Execution start: 2015-06-22 13:38:17.707905
Processing end: 2015-06-22 13:39:40.079539
I had to manually kill hbal process at 2015-06-22 14:26, and then re-issue it
to execute the rest of the commands.
Original comment by alafento...@gmail.com
on 22 Jun 2015 at 12:51
How long is hbal stalled after this?
hbal is backing off the intervals at which it checks for
the status of its submitted jobs till, for very long running
jobs, it checks only once every 15 seconds. (Also, the next
batch of jobs is only submitted, once the previous was recognized
to have finished successfully.)
Original comment by aeh...@google.com
on 22 Jun 2015 at 1:06
Original issue reported on code.google.com by
kargig.l...@gmail.com
on 27 Mar 2014 at 11:46