lizzieinvancouver / temporalvar

0 stars 0 forks source link

Prepare for the end of Regal #44

Closed lizzieinvancouver closed 5 years ago

lizzieinvancouver commented 5 years ago

From a 11 Feb 2019 email:

NEW SCRATCH FILESYSTEM The new scratch filesystem to replace Regal is now online at /n/scratchlfs Regal will remain usable until March 4th after which it will be set read-only until May 4th when it will be decommissioned. Please move any data you care about as soon as you are able.

lizzieinvancouver commented 5 years ago

...Regal is being taken down starting 7am on 6 May 2019! We should move (using mv or cp I think) to scratchlfs

I believe scratchlfs is the replacement for regal, so we should run jobs there (other options are scratch or lab storage according to this link but with scratch `data is only accessible from the node itself so you cannot directly retrieve it after calculations are finished' so I don't think we want that).

Also, not so sure about the new system they have setup!

As most of you are aware, we had a significant failure on our new scratch file system (scratchlfs) which necessitated taking the SLURM scheduler offline for several hours. As this was a significant and unplanned outage, we wanted to follow up with you all and provide more information on what happened, what has and is being done to address this, and some background on our early-stage reliance on the vendor of this bleeding-edge filesystem.

On Friday at around 2:30pm one of the two redundant metadata servers (MDS) for scratchlfs failed, causing the backup to take over. However, mds02 then began to kernel panic and hang. The first MDS server, mds01, had by then restarted and so took back over. When mds01 again hung, this began a looping back-and-forth failure. We attempted firmware updates across the systems, but with no improvement. By 4pm we we had contacted the vendor (DDN) for assistance as none of the standard recovery methods were working.

The tier 1 DDN support wrote back asking for logs but did not escalate the problem to engineering as they should have given the severity. By this time we had stopped SLURM to try to lower the load on scratchlfs and regain control over it. Our Director, Scott Yockel, was simultaneously attempting to get our issue escalated through other channels. Around 10:30pm Tier 1 support let us know that they were escalating us to engineering and by 11:30pm we had them online with our engineers. After some probing and testing it was found that we had hit an unknown bug in the Lustre filesystem that scratchlfs uses that would induce these looping kernel panics.

All of this was exacerbated by a failing Infiniband cable on mds01, which contributed to the need to fail-over, so it was taken offline during troubleshooting and we relied solely on mds02 until the cable could be replaced that (Saturday) morning. During replacement we identified that the fibre cable that connects the mds01 & mds02 to the backend storage target (mdt) needed to be replaced, as one of the cables was crimped during the vendor's installation. With the primary bad cable replaced, we were able to bring mds01 back online and, after some configuration changes and recovery, scratchlfs was back online and the cluster was re-opened around 4pm. The bug is still present until Lustre is updated, but a bug report has been filed and a workaround is in place from the Lustre filesystem group (Whamcloud) which will allow the system to function despite it.

Our scratch filesystem is obviously an integral part of the cluster, and we are taking these concerns and this outage very seriously. We are working with the vendor to address our dissatisfaction with their response time (this is our first partnership with DDN) and with the spate of smaller issues with the filesystem since installation. They are also working with us to address the mishandling of scratch during installation, which has resulted in DDN having to replace several fibre cables that were damaged but went undiscovered until this latest failure. We are looking at all options surrounding scratchlfs and leaning hard on the vendor to step up their support efforts.

We appreciate your understanding to that end. If you have any questions, please don’t hesitate to contact us at: rchelp@rc.fas.harvard.edu

We would also like to give special thanks to Paul Edmon and Luis Silva who both worked through the night and into the early hours on this, and to Mike Ethier for getting out to Holyoke early Saturday morning to replace the failing Infiniband cable.

lizzieinvancouver commented 5 years ago

Megan moved everything!