status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 5 forks source link

Upgrade storage for mainnet fleet #184

Closed jakubgs closed 6 months ago

jakubgs commented 6 months ago

It's about time we increase the storage available for both Docker containers(Geth) and Systemd services(Beacon Nodes):

jakubgs@linux-01.ih-eu-mda1.nimbus.mainnet:~ % df -h / /docker /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       366G   62G  286G  18% /
/dev/sdc        1.5T  1.2T  174G  88% /docker
/dev/sdb        1.5T  1.2T  259G  82% /data

The current layout involves single logical volume per single physical volume(SSD) configured in the controller.

The migration to RAID0 logical volumes using two SSDs using a HPE Smart Array utility is documented here: https://docs.infra.status.im/general/hp_smart_array_raid.html

The steps for the migration of each host will look like this:

  1. Request attachment of a temporary 1.5 TB(or bigger) SSDs on the host for migrations.
  2. Migrate /data files to temporary migration SSD.
  3. Destroy /data logical volume and re-create it with two physical volumes(SSDs) as one RAID0 logical volume.
  4. Migrate from temporary SSD back to new RAID0 /data volume.
  5. Repeat steps 2, 3, & 4 for the /docker volume.
  6. Inform support they can move the migration SSD to another host, and repeat for that host.

I would recommend creating a single support ticket to order 2 extra SSDs of the same type for all nimbus.mainnet hosts, and then manage migration of each host in the comments of that ticket.

jakubgs commented 6 months ago

You can find more example of me using ssacli to configure volumes here:

Just need to click the Load more... button:

image

yakimant commented 6 months ago

Ticket Created #351756

yakimant commented 6 months ago

We are able to connect today only 9 SSD disks of 1.6TB capacity (4 servers). If you need all with 1.6TB capacity, then it will be possible to connect the remaining in the next 2 weeks or if you want, we may connect a 3.84TB drive on each of the remaining servers.

The cost for the single additional disk is 20 euro per 1.6TB SSD drive.

Asked about 4TB price and 3TB (if they have)

yakimant commented 6 months ago

4TB will cost twice of that, will go ahead.

Pros:

Cons:

yakimant commented 6 months ago

ssacli installation:

echo "deb http://downloads.linux.hpe.com/SDR/repo/mcp jammy/current non-free" | sudo tee /etc/apt/sources.list.d/hp-mcp.list
wget -qO- http://downloads.linux.hpe.com/SDR/hpPublicKey1024.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
wget -qO- http://downloads.linux.hpe.com/SDR/hpPublicKey2048.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
wget -qO- http://downloads.linux.hpe.com/SDR/hpPublicKey2048_key1.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
wget -qO- http://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub | sudo tee -a /etc/apt/trusted.gpg.d/hp-mcp.asc
apt update
apt install ssacli
yakimant commented 6 months ago

Disk are installed:

❯ ansible nimbus-mainnet-metal -i ansible/inventory/test -a 'sudo ssacli ctrl slot=0 pd allunassigned show'
linux-06.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS SSD, 3.8 TB, OK)
linux-07.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 3.8 TB, OK)
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 2I:2:6 (port 2I:box 2:bay 6, SAS SSD, 3.8 TB, OK)
linux-04.ih-eu-mda1.nimbus.mainnet | FAILED | rc=1 >>

Error: The controller identified by "slot=0" was not detected.non-zero return code
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 2I:2:6 (port 2I:box 2:bay 6, SAS SSD, 3.8 TB, OK)
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS SSD, 3.8 TB, OK)
linux-05.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>

Smart Array P420i in Slot 0 (Embedded)

   Unassigned

      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS SSD, 3.8 TB, OK)

linux-04 has a different slot:

❯ sudo ssacli ctrl slot=1 pd allunassigned show

Smart Array P222 in Slot 1

   Unassigned

      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS SSD, 3.8 TB, OK)
yakimant commented 6 months ago

IH had an issue with disks. They fixed in on linux-01 and I was able to set them up.

Was done with approximately these actions:

sudo ssacli ctrl slot=0 pd all show status
sudo ssacli ctrl slot=0 create type=ld drives=DRIVE (cheange here, eg 2I:2:6)
sudo ssacli ctrl slot=0 ld all show status
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=linux-01.ih-eu-mda1.nimbus.mainnet -Dv -t role::bootstrap:volumes
docker-compose -f docker-compose.exporter.yml -f docker-compose.yml stop
sudo systemctl stop syslog
sudo rsync -Pa /mnt/sdc/geth-mainnet /mnt/sdd/geth-mainnet
[ansible/group_vars/ih-eu-mda1.yml] change /docker
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=linux-01.ih-eu-mda1.nimbus.mainnet -Dv -t role::bootstrap:volumes
docker-compose -f docker-compose.exporter.yml -f docker-compose.yml start
[grafana] check geth graphs - syncing
sudo systemctl stop beacon-node-mainnet-*
sudo rsync -Pa sdb/beacon-node-mainnet-* sdb/era sdd/sdb/
sudo ssacli ctrl slot=0 ld all show status
sudo ssacli ctrl slot=0 ld 2 delete
sudo ssacli ctrl slot=0 ld 3 delete
sudo ssacli ctrl slot=0 pd all show status
sudo ssacli ctrl slot=0 create type=ld drives=DRIVE1,DRIVE2 raid=0 # (eg 1I:2:1,1I:2:2)
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=linux-01.ih-eu-mda1.nimbus.mainnet -Dv -t role::bootstrap:volumes
sudo systemctl start beacon-node-mainnet-*
[grafana] check nimbus graphs - syncing

Need to fine-tune a bit:

yakimant commented 6 months ago

BTW if docs or support asks about using the different tool - here is the timeline

  1. hpacucli (versions 9.10 - 9.40, 2012 - 2014, probaly before too)
  2. hpssacli (versions 2.0 - 2.40, 2014 - 2016)
  3. ssacli (versions 3.10 - 6.30, 2017 - now)
yakimant commented 6 months ago

Done, disks are setup with these commands:

[local] ansible linux-05.ih-eu-mda1.nimbus.mainnet,linux-06.ih-eu-mda1.nimbus.mainnet,linux-07.ih-eu-mda1.nimbus.mainnet -a 'sudo systemctl stop consul'
sudo ssacli ctrl slot=0 pd all show status; sudo ssacli ctrl slot=0 ld all show status
sudo ssacli ctrl slot=0 create type=ld drives=DRIVE (cheange here, eg 2I:2:6)
sudo ssacli ctrl slot=0 ld all show status
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=HOSTNAME -Dv -t role::bootstrap:volumes
docker-compose -f /docker/geth-mainnet/docker-compose.exporter.yml -f /docker/geth-mainnet/docker-compose.yml stop
sudo systemctl stop syslog beacon-node-mainnet-*
sudo rsync --stats -hPa --info=progress2,name0 /docker/geth-mainnet /docker/log /data/beacon* /data/era /mnt/sdd/
sudo umount /mnt/sdb /data /docker /mnt/sdc
sudo ssacli ctrl slot=0 ld all show status
sudo ssacli ctrl slot=0 ld 2 delete
sudo ssacli ctrl slot=0 ld 3 delete
sudo ssacli ctrl slot=0 pd all show status
sudo ssacli ctrl slot=0 create type=ld raid=0 drives=DRIVE1,DRIVE2  # (eg 1I:2:1,1I:2:2)
[local] ansible-playbook -i ansible/inventory/test ansible/bootstrap.yml --limit=HOSTNAME -Dv -t role::bootstrap:volumes
sudo rsync --stats -hPa --info=progress2,name0 /docker/beacon* /docker/era /data/
docker-compose -f /docker/geth-mainnet/docker-compose.exporter.yml -f /docker/geth-mainnet/docker-compose.yml start
sudo systemctl start beacon-node-mainnet-* syslog
[grafana] check geth graphs - syncing
[grafana] check nimbus graphs - syncing

Smth is missing before the second ansible-playbook run, causing the issues:

I didn't research much and just put command above.

One finding - linux-02.ih-eu-mda1.nimbus.mainnethas some wierd disk attached, but not others:

❯ lsblk /dev/sdd

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdd      8:48   0  256M  1 disk
`-sdd1   8:49   0  251M  1 part

❯ sudo fdisk -l /dev/sdd

Disk /dev/sdd: 256 MiB, 268435456 bytes, 524288 sectors
Disk model: LUN 00 Media 0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000046

Device     Boot Start    End Sectors  Size Id Type
/dev/sdd1          63 514079  514017  251M  c W95 FAT32 (LBA)

❯ ls -l /dev/disk/by-id /dev/disk/by-path/ | grep sdd
lrwxrwxrwx 1 root root  9 Feb 22 19:05 usb-HP_iLO_LUN_00_Media_0_000002660A01-0:0 -> ../../sdd
lrwxrwxrwx 1 root root 10 Feb 22 19:05 usb-HP_iLO_LUN_00_Media_0_000002660A01-0:0-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  9 Feb 22 19:05 pci-0000:00:1d.0-usb-0:1.3.1:1.0-scsi-0:0:0:0 -> ../../sdd
lrwxrwxrwx 1 root root 10 Feb 22 19:05 pci-0000:00:1d.0-usb-0:1.3.1:1.0-scsi-0:0:0:0-part1 -> ../../sdd1

Looks like it's some USB 256Mb drive.

ChatGPT says it can smth to do with HP iLO (Integrated Lights-Out) LUN (Logical Unit Number). I have no idea, whats that.

jakubgs commented 5 months ago

Thanks for getting this done.