Open sbernauer opened 5 months ago
Must: The docs say "Disable the Load Balancer before Decommissioning a node". We found a solution to this by either doing so or making sure we (or our customers) are not using LBs
Can we just use readiness probes to take the pod out of service?
There need to be at lease two shutdown modes:
Findings (in progress):
hbase/bin/hbase-daemon.sh
can start/stop/restart etc. and already handles termination signals better then our home grown solution.
jstack
to do a thread dump in case shutdown takes longer than 20 mins. jstack
is not in our images. graceful_stop.sh
script requires the hostname
or ssh
commands which are not available in the Hbase images currently.
ssh
requirement by passing localhost
as the name of the region server but the script needs hostname
to find out the actual region server name.hbase-daemon.sh
which writes PID files for every process.During testing it was discovered that region servers already transfer regions when shutting down. This behavior is implemented in the 2.4 and 2.6 versions.
To clarify:
Another idea : since this is the default behavior anyway, maybe in cases like rolling cluster restarts, the user would benefit more from actually disabling the region mover altogether during that period.
This will be discussed next week
Relevant docs: https://hbase.apache.org/book.html#decommission Relevant script: graceful_stop.sh Relevant class: org.apache.hadoop.hbase.util.RegionMover, with relevant function
In https://github.com/stackabletech/hbase-operator/issues/400 we implemented a graceful shutdown for all HBase components which is similar to
./bin/hbase-daemon.sh stop <service>
. While this works in general it has downsides, such regions being offline for some time, resulting in (short) outages.Instead we should try to call or mimic
graceful_stop.sh
. The graceful_stop.sh script will move the regions off the decommissioned RegionServer one at a time to minimize region churn. It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions. At this point, the graceful_stop.sh tells the RegionServer stop. The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.