purestorage / helm-charts

Pure Storage Helm Charts
Apache License 2.0
47 stars 43 forks source link

Pureflex unmounting root volume #97

Closed shawhall closed 5 years ago

shawhall commented 5 years ago

I'm having an issue with the pureflex driver removing all multipaths on my root and /var volume. These volumes are located on the array but should not be touched by the pureflex driver as they are not PVCs. When this happens, the machine locks up as it cannot connect to those volumes any more. Anyone know how to fix it from disconnecting our boot and /var volumes?

Here is an excerpt from the log.

May 30 18:42:41 lpul-k8sprdwrk2 pureflex[1657]: time="2019-05-31T01:42:41Z" level=warning msg="Found multipath map that should NOT be attached, adding to list for cleanup" connection="{Name: LUN:0 Volume: Hgroup:}" dev="DeviceInfo{BlockDev: dm-0, LUN: , Serial: , WWID: 3624A9370D5BC07893D61B51B00255378}" shouldBeConnected=false

taherv commented 5 years ago

The WWID you mention in your log line is for a Pure FlashArray LUN. (3624A9370D5BC07893D61B51B00255378). It would make sense for PSO to try to clean it up.

Are your root and /var volumes managed by multipath ?

shawhall commented 5 years ago

Yes it is a Pure Flasharray LUN. Its not a docker volume. Its the boot volume for the whole system. If it cleans it up, that will be a problem. It should only be cleaning up the docker volumes.

root and /var are both multipath FC Pure Array LUNs.

shawhall commented 5 years ago

pure_k8s Here is a diagram of our architecture, if that helps.

sdodsley commented 5 years ago

@shawhall can you detail steps to reproduce this event? Can you also give the output from docker info

briapi commented 5 years ago

@sdodsley We opened a ticket on this issue with support, it is case number 00580464 which should have all the environment info and how it was triggered.

sdodsley commented 5 years ago

@briapi I understand, but since you have opened an issue here then it will help people who don't have access to the Pure internal support system, like when we are travelling and can't get into the VPN...

briapi commented 5 years ago

Sure, here is the docker info output: Containers: 219 Running: 156 Paused: 0 Stopped: 63 Images: 84 Server Version: 18.09.2 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 3.10.0-957.10.1.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 56 Total Memory: 503.6GiB Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false Product License: Community Engine

Looking back in the logs, the PSO appears to have been always trying to delete the system multipath devices when provisioning PVCs, but wasn't successful in doing so until we did an upgrade from 2.1.2 to 2.5.1, after that it would succeed in removing the root and other system multipath devices that were also on the Pure array, which then crashed the system. It would do this on startup, or when a cluster event would trigger provisioning.

pure-garyyang commented 5 years ago

We have a fix for this issue at the new release 2.5.2. Could you please try it out? https://github.com/purestorage/helm-charts/releases/tag/2.5.2

shawhall commented 5 years ago

Thanks Gary. We will try it out and report back.

dinathom commented 5 years ago

Should we close this issue since it was fixed in 2.5.2 No updates in a month

shawhall commented 5 years ago

We tested and its working fine. Thanks for getting this fixed.