openshift / origin-server

OpenShift 2 (deprecated)
889 stars 516 forks source link

Gears exceeding quota should be allowed to stop, start, move #6363

Closed sallyom closed 8 years ago

sallyom commented 8 years ago

bz 1292133 https://bugzilla.redhat.com/show_bug.cgi?id=1292133

Create a 1% quota buffer so that gears exceeding allowed quota can still be moved, stopped or started.

sallyom commented 8 years ago

output: [fakeapp-sally.dev.rhcloud.com 56be083a97846171ea0000a4]> quota Disk quotas for user 56be083a97846171ea0000a4 (uid 5578): Filesystem blocks quota limit grace files quota limit grace /dev/xvda1 1048576* 0 1048576 203 0 80000

Successfully moved gear with uuid '56be083a97846171ea0000a4' of app 'fakeapp' from 'ip-172-18-5-113' to 'ip-172-18-2-173'

[root@ip-172-18-5-113 ~]# rhc app show fakeapp --gears quota Gear Cartridges Used Limit


56be083a97846171ea0000a4 php-5.4 1 GB 1 GB

Successfully moved gear with uuid '56be083a97846171ea0000a4' of app 'fakeapp' from 'ip-172-18-2-173' to 'ip-172-18-5-113'

All Successfully complete: rhc app restart fakeapp, rhc app stop fakeapp, rhc app start fakeapp

@Miciah, what am I missing? :) Please review

tiwillia commented 8 years ago

This seems like it should work.

Lets [test]

The only worry I have is when remove_quota is called during the gear deletion process. At that point, there /should/ be no data on disk used by the gear, so it shouldn't matter there.

sallyom commented 8 years ago

@tiwillia please check this out, thanks @a13m is 1% an acceptable bump in destination quota limits for gear moves?

sallyom commented 8 years ago

[test] pleeeeeeeeese

ironcladlou commented 8 years ago

Should the inode quota also be temporarily increased?

tiwillia commented 8 years ago

+1 inodes probably also need to be checked and increased.

Currently the changes are permanent. These hanges should probably be temporary, right?

sallyom commented 8 years ago

Increasing the inode limit will be np, but first I'll check w/ @a13m bc the bz doesn't specify a problem with inode quota, only the blocks. Current workaround script increases inode quota, though so I think I'll just go ahead and do it :)
Yea, it should be/will be soon temporary :)

a13m commented 8 years ago

1) Yes, both inode and block quota should be checked / bumped. inode is not as common to hit, but it happens. 2) Instead of only bumping gears at 100% of quota, I think you need to have some tolerance -- if the gear is already at 99.99% of quota, you should bump it, because chances are is will hit quota during the move, just from log data. 3) for scaled applications, you need to check/bump quota for all gears, because stop/start runs on all gears, and the head gear has to rewrite haproxy.cfg as well.

sallyom commented 8 years ago

@a13m and then reset to default limits after the move (that's what I'm working on now)? Or, would it be acceptable to keep the gears with the new slightly higher quota limits (possibly less than 1.01)?

sallyom commented 8 years ago

[test] @a13m I'm looking at the issue of bumping/checking for all gears in scaled apps. I notice when moving a non-primary gear that is over limit, it succeeds so long as the head gear is moved first. Looking into how to best implement your 3) comment above. Other than that I've implemented all fb.

sallyom commented 8 years ago

_SAMPLE OUTPUT FROM A MOVE OF A PRIMARY GEAR THAT IS OVER QUOTA_ App UUID: 56cd9713cde40cbff900001a Gear UUID: 56cd9713cde40cbff900001a DEBUG: Source district uuid: 56cd74bfcde40c3b4f000001 DEBUG: Destination district uuid: 869826691282967182966784 WARNING: Quota exceeds 99% of target quota limits, now using buffer. quota_blocks max with buffer: 1053819, quota_files max with buffer: 80100 DEBUG: Getting existing app 'lastone' status before moving DEBUG: Gear component 'php-5.4' was running DEBUG: Stopping existing app cartridge 'php-5.4' before moving DEBUG: Stopping existing app cartridge 'haproxy-1.4' before moving DEBUG: Force stopping existing app before moving DEBUG: Reserved uid '2352' on district: '869826691282967182966784' DEBUG: Gear platform is 'linux' DEBUG: Creating new account for gear '56cd9713cde40cbff900001a' on ip-172-18-12-97 DEBUG: Moving content for app 'lastone', gear '56cd9713cde40cbff900001a' to ip-172-18-12-97 Agent pid 23525 unset SSH_AUTH_SOCK; unset SSH_AGENT_PID; echo Agent pid 23525 killed; DEBUG: Moving system components for app 'lastone', gear '56cd9713cde40cbff900001a' to ip-172-18-12-97 Agent pid 23580 unset SSH_AUTH_SOCK; unset SSH_AGENT_PID; echo Agent pid 23580 killed; DEBUG: Starting cartridge 'haproxy-1.4' in 'lastone' after move on ip-172-18-12-97 DEBUG: Starting cartridge 'php-5.4' in 'lastone' after move on ip-172-18-12-97 DEBUG: Fixing DNS and mongo for gear '56cd9713cde40cbff900001a' after move DEBUG: Changing server identity of '56cd9713cde40cbff900001a' from 'ip-172-18-10-186' to 'ip-172-18-12-97' DEBUG: Deconfiguring old app 'lastone' on ip-172-18-10-186 after move WARNING: Gear was moved successfully but quota limits for small node profile exceeded.

sallyom commented 8 years ago

@a13 @tiwillia for your review. Please see my comments here: https://github.com/sallyom/origin-server/blob/bzgearexceedquota/plugins/msg-broker/mcollective/lib/openshift/mcollective_application_container_proxy.rb#L1771
I will continue testing to be sure my comment holds true. Thanks

sallyom commented 8 years ago

@tiwillia @a13m
1) increased max to 1.2 times original and each time increments by 1% or 50 inodes 2) for gear moves, will bump if quota > 95% limits, for update-cluster will bump if quota > 99% limits less for update-cluster to minimize double-bumps looking good? :)

tiwillia commented 8 years ago

@sallyom this LGTM.

This looks ready to merge. Should we wait for @a13m 's blessing?

sallyom commented 8 years ago

@tiwillia @a13m
I made the buffer configurable. Also, if usage that was previously over-limit is decreased under default quota limits the limits will be reset to default in next move/update-cluster. Increments of 1%, 10 inodes with warnings during move and during update-cluster to max of (default * configured multiplier)...hopefully it will pass all tests and be ready for merge now. : )

tiwillia commented 8 years ago

There is going to be a bit of confusion about having this configuration on both the broker and the node. With so many nodes in the environment, it could be possible that an administrator sets the node's configuration different from that set on the broker.

Would the only result of this be that the warning that comes from the node would be shown at incorrect quota usage?

sallyom commented 8 years ago

@tiwillia I made the style change you suggested, thanks! Also, I removed the node.conf change, it wasn't necessary.

tiwillia commented 8 years ago

Alrighty, I think this is ready to [merge]

Thanks for putting this together @sallyom !

openshift-bot commented 8 years ago

Evaluated for online test up to 0ed007ec81b7e5cde517c65ec1de59ec0cb5fe15

openshift-bot commented 8 years ago

Online Merge Results: SUCCESS (https://ci.dev.openshift.redhat.com/jenkins/job/merge_pull_requests/6719/) (Image: devenv_5779)

openshift-bot commented 8 years ago

Online Test Results: SUCCESS (https://ci.dev.openshift.redhat.com/jenkins/job/test_pull_requests/9239/)

tiwillia commented 8 years ago

Merge failure appears to be unrelated. Lets try this again and see if we get different results [merge]

openshift-bot commented 8 years ago

Evaluated for online merge up to 0ed007ec81b7e5cde517c65ec1de59ec0cb5fe15