hail: failure reasons: FailDisk: 12

GoogleCodeExporter commented 9 years ago

Hi,

When we run the following command:
gnt-instance add --submit -t drbd --disk 0:size=10GB --no-name-check 
--no-ip-check --no-wait-for-sync -B memory=768MB,vcpus=1 -H 
kvm:cdrom_image_path=/srv/ganeti/iso/debian7.iso,boot_order=cdrom -o 
linux-image+default VMname

We get the following error;

Job 192186 failed: Failure: prerequisites not met for this operation:
error type: insufficient_resources, error details:
Can't compute nodes using iallocator 'hail': Request failed: Group default 
(preferred): No valid allocation solutions, failure reasons: FailDisk: 12

We had this a few weeks ago, went away by itself and than came back last week.

as far as I can see I have sufficient resources

Creating instances with -n Pnode:Snode does work as expected.

Not 100% sure if its a bug as it used to work on the same version but help will 
be appreciated

Kind regards,

Eadric Wildeboer

Some numbers from the time of the error:

gnt-cluster command vgdisplay:
------------------------------------------------
node: node01
  --- Volume group ---
  VG Name               ganeti
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  729
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                76
  Open LV               76
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.08 TiB
  PE Size               4.00 MiB
  Total PE              282797
  Alloc PE / Size       129472 / 505.75 GiB
  Free  PE / Size       153325 / 598.93 GiB
  VG UUID               BrfXoa-fKR0-k52q-1xWP-W3Kw-SxYX-ERdGmF

return code = 0
------------------------------------------------
node: node03
  --- Volume group ---
  VG Name               ganeti
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1416
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                64
  Open LV               64
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.08 TiB
  PE Size               4.00 MiB
  Total PE              282797
  Alloc PE / Size       88064 / 344.00 GiB
  Free  PE / Size       194733 / 760.68 GiB
  VG UUID               CDloX2-f338-5bmw-Bimy-QxoT-mRbZ-0STKNo

return code = 0
------------------------------------------------
node: node04
  --- Volume group ---
  VG Name               ganeti
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1437
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                66
  Open LV               66
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.08 TiB
  PE Size               4.00 MiB
  Total PE              282797
  Alloc PE / Size       113696 / 444.12 GiB
  Free  PE / Size       169101 / 660.55 GiB
  VG UUID               FybflY-adfy-cfAA-BynT-Ogh3-CHVn-BGq3n3

return code = 0
------------------------------------------------
node: node02
  --- Volume group ---
  VG Name               ganeti
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  598
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                78
  Open LV               78
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.08 TiB
  PE Size               4.00 MiB
  Total PE              282797
  Alloc PE / Size       155104 / 605.88 GiB
  Free  PE / Size       127693 / 498.80 GiB
  VG UUID               Uc8RsZ-RotK-ck6n-fJTP-qm4U-HGfN-tzrQrV

gnt-cluster command free:

------------------------------------------------
node: node01
             total       used       free     shared    buffers     cached
Mem:      32991304   12237592   20753712          0    5814020     428572
-/+ buffers/cache:    5995000   26996304
Swap:      7811056          0    7811056

return code = 0
------------------------------------------------
node: node03
             total       used       free     shared    buffers     cached
Mem:      32991304    4600256   28391048          0    2206940     404728
-/+ buffers/cache:    1988588   31002716
Swap:      7811040          0    7811040

return code = 0
------------------------------------------------
node: node04
             total       used       free     shared    buffers     cached
Mem:      32991304    6452304   26539000          0    3367552     229468
-/+ buffers/cache:    2855284   30136020
Swap:      5858292          0    5858292

return code = 0
------------------------------------------------
node: node02
             total       used       free     shared    buffers     cached
Mem:      32991304   32723944     267360          0   25835844     651796
-/+ buffers/cache:    6236304   26755000
Swap:      7811056       8488    7802568

return code = 0

These are the details of this cluster:
gnt-cluster (ganeti v2.8.1) 2.8.1
hspace (ganeti) version v2.8.1
compiled with ghc 7.4
running on linux x86_64
Debian 7 with backports for ganeti

Original issue reported on code.google.com by eadric.w...@gmail.com on 4 Feb 2014 at 7:36

Merged into: #503

GoogleCodeExporter commented 9 years ago

To see if it would help (solve) the problem we upgraded to version 2.9.3 of 
ganeti, however it did not resolve this.

Greetings,

Eadric Wildeboer

Original comment by eadric.w...@gmail.com on 4 Feb 2014 at 8:55

GoogleCodeExporter commented 9 years ago

I just want to make sure, that this is not a duplicate of 694. How did you set 
the spindle oversubscription ratio and how many instances do you have per node?

Original comment by aeh...@google.com on 5 Feb 2014 at 8:55

GoogleCodeExporter commented 9 years ago

Hi,

I hope the following info helps.

node01: 21 instances
node02: 20 instances
node03: 21 instances
node04: 19 instances

All instances are drbd with secondaries spread through the cluster in a
reasonable ballance

gnt-cluster info | grep spindle
  spindle_count: 1
        spindle-use: 12
        spindle-use: 1
    spindle-use: 1
  spindle-ratio: 32

Original comment by eadric.w...@gmail.com on 5 Feb 2014 at 11:23

GoogleCodeExporter commented 9 years ago

So, unless I'm misunderstanding something,
each node has one spindle, allowing a 32 to 1 oversubscription.
about 20 instances per node, with drbd should give about 40 spindles
used per node... That would at least explain the problem.

Does 'gnt-cluster modify --ipolicy-spindle-ratio=128' solve the problem?

Original comment by aeh...@google.com on 5 Feb 2014 at 11:48

GoogleCodeExporter commented 9 years ago

Ok,

I've done that and it does help. Now I get messages about CPU. I'm gathering 
this is to be seen as expected behavior.

I have two questions:

Is there documentation about the best practice for tuning these settings to the 
available hardware? I believe these are default settings.

Is this behavior new sinds 2.5? This same cluster run under 2.5 in the past 
with about twice the instances as it has now and I don't recollect that we 
experienced these issues.

Original comment by eadric.w...@gmail.com on 5 Feb 2014 at 3:54

GoogleCodeExporter commented 9 years ago

Hi,

We currently have no documentation on the proper values unfortunately. The 
reason we made it possible to toggle them is indeed that we didn't know of a 
"correct" value that would work for everyone. The right value depends more on 
the workload on the instances, and how much they interfere with each other than 
on the actual hardware.

Policies IIRC have been added in 2.6+, so that one can avoid inadvertently 
oversubscribe a node.

Thanks,

Guido

Original comment by ultrot...@google.com on 5 Feb 2014 at 4:02

GoogleCodeExporter commented 9 years ago

hi,

Fair enough.

Perhaps there could be a forum thread or something with experiences from users 
so that over time some trends could surface.

Should I start one?

Original comment by eadric.w...@gmail.com on 5 Feb 2014 at 4:30

GoogleCodeExporter commented 9 years ago

I think this is actually a duplicate of Issue 503, although the headline of 
that one is missleading (will fix this).

Original comment by hel...@google.com on 12 Feb 2014 at 3:50

Changed state: Duplicate

olopez32 / ganeti

hail: failure reasons: FailDisk: 12 #705