pelias / terraform-elasticsearch

Terraform scripts for running an Elasticsearch cluster
10 stars 6 forks source link

udev block device rule #26

Closed missinglink closed 4 days ago

missinglink commented 2 years ago

add udev rule to ensure block device symlinks exist for modern nvme EBS mappings

for a while now I've noticed intermittent startup failures on smaller elasticsearch machines, I was never able to put my finger on the issue but I suspected a race between kernel tasks and cloud-init.

today I made some progress, mainly because upgrading to ubuntu focal seems to make it fail more consistently.

what I think is going on is that there is either a race between cloud-init and udev... or that udev isn't working properly due to python2.7 not being available on modern Ubu distros.

but backing up a second, what's the issue?

well, with some more modern AWS machines you request an EBS block device mapping of something like /dev/sdb but when you boot it's actually available as /dev/nvme2n1 or something similar 🤷‍♂️

it's kind of odd, but I believe that this is due to the 'Nitro' system using an NVME driver for EBS volumes.

so to get around this blaring issue AWS encodes some 'vendor info' in the NVME mapping binary header which contains information about the mapping you actually requested.

there is then a udev rule (the last line below) which is responsible for detecting all this and creating a symlink:

cat /etc/udev/rules.d/10-aws.rules
KERNEL=="xvd*", PROGRAM="/sbin/ec2udev-vbd %k", SYMLINK+="%c"
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/sbin/ebsnvme-id -u /dev/%k", SYMLINK+="%c"

when this symlink isn't created or isn't created YET, then things break:

mke2fs 1.45.5 (07-Jan-2020)
The file /dev/sdb does not exist and no size was specified.
waiting for elasticsearch service to come up
..............................

Elasticsearch did not come up, check configuration

the udev rule installed by default seems to be broken on modern Ubu because it runs /sbin/ebsnvme-id which doesn't work because it requires python2.7 which isn't installed 😢

I tried installing python2.7 and trigger the rules and it still doesn't work so 🤷‍♂️

that's when I found this article https://opensource.creativecommons.org/blog/entries/2020-04-03-nvmee-on-debian-on-aws/ pointing me to https://github.com/oogali/ebs-automatic-nvme-mapping

orangejulius commented 2 years ago

Nice, yeah, the current system here is a bit brittle and complicated.

In some Geocode Earth infra we use a simpler method, which is basically to look at the symlinks in /dev/disk/by-id:

$ ls -lh /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS222D6E08AD542C2D4 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-Amazon_Elastic_Block_Store_vol0d00dbaf264800849 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 15:28 nvme-Amazon_Elastic_Block_Store_vol0d00dbaf264800849-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-nvme.1d0f-4157533232324436453038414435343243324434-416d617a6f6e20454332204e564d6520496e7374616e63652053746f72616765-00000001 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Feb  8 15:28 nvme-nvme.1d0f-766f6c3064303064626166323634383030383439-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 15:28 nvme-nvme.1d0f-766f6c3064303064626166323634383030383439-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1 -> ../../nvme0n1p1

That doesn't require any tools, seems to be populated instantly, and makes it quite clear which volumes are EBS, which are NVMe, etc. Maybe we simplify and use that?

missinglink commented 2 years ago

I had a look at that and unfortunately it doesn't seem possible, the AWS docs say there's no guarantee that the ordinal numbers correspond to the order they were defined or anything really, the only consistent way seems to be to check the block device binary header where the requested mapping path is encoded.

This is what that command looks like on a t4g.xlarge ARM instance:

ls -lh /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0a332d2f708ad23f8 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0a332d2f708ad23f8-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 16 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0a332d2f708ad23f8-part15 -> ../../nvme0n1p15
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0eeb851c145fc2b4d -> ../../nvme2n1
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-Amazon_Elastic_Block_Store_vol0fd2cdbaccc423f58 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-nvme.1d0f-766f6c3061333332643266373038616432336638-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Feb  8 16:15 nvme-nvme.1d0f-766f6c3061333332643266373038616432336638-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 16 Feb  8 16:15 nvme-nvme.1d0f-766f6c3061333332643266373038616432336638-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part15 -> ../../nvme0n1p15
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-nvme.1d0f-766f6c3065656238353163313435666332623464-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme2n1
lrwxrwxrwx 1 root root 13 Feb  8 16:15 nvme-nvme.1d0f-766f6c3066643263646261636363343233663538-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 -> ../../nvme1n1

the scripts in this repo create these symlinks, which doesn't seem to be possible from the information above:

lrwxrwxrwx  1 root root           7 Feb  8 16:15 sda1 -> nvme0n1
lrwxrwxrwx  1 root root           7 Feb  8 16:15 sdb -> nvme2n1
lrwxrwxrwx  1 root root           7 Feb  8 16:15 sdc -> nvme1n1
missinglink commented 2 years ago

The script we have which selects the first available disk matching a pattern would be susceptible to error since there are multiple and there's no guarantee the correct device is selected using head -n1

Screenshot 2022-02-08 at 17 21 40
missinglink commented 2 years ago

I've tested this on a t4g.xlarge running an AMI tagged dev-es7.16-arm and after a few iterations it's working great 🎉

Before we consider merging this we should change the cURL commands I'm using to get the scripts from github to actual files committed to this repo, for security reasons.

missinglink commented 2 years ago

for reference, this is what this binary encoded header looks like (note: sdb encoded in the first bytes)

sudo nvme id-ctrl --vendor-specific /dev/nvme2n1
NVME Identify Controller:
vid     : 0x1d0f
ssvid   : 0x1d0f
sn      : vol0eeb851c145fc2b4d
mn      : Amazon Elastic Block Store
fr      : 1.0
rab     : 32
ieee    : a002dc
cmic    : 0
mdts    : 6
cntlid  : 0
ver     : 10000
rtd3r   : 0
rtd3e   : 0
oaes    : 0x100
ctratt  : 0
oacs    : 0
acl     : 4
aerl    : 0
frmw    : 0x3
lpa     : 0
elpe    : 63
npss    : 0
avscc   : 0x1
apsta   : 0
wctemp  : 343
cctemp  : 0
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
edstt   : 0
dsto    : 0
fwug    : 0
kas     : 0
hctma   : 0
mntmt   : 0
mxtmt   : 0
sanicap : 0
hmminds : 0
hmmaxd  : 0
sqes    : 0x66
cqes    : 0x44
maxcmd  : 0
nn      : 1
oncs    : 0
fuses   : 0
fna     : 0
vwc     : 0
awun    : 0
awupf   : 0
nvscc   : 0
acwu    : 0
sgls    : 0
subnqn  :
ioccsz  : 0
iorcsz  : 0
icdoff  : 0
ctrattr : 0
msdbd   : 0
ps    0 : mp:0.01W operational enlat:1000000 exlat:1000000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
vs[]:
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 73 64 62 20 20 20 20 20 20 20 20 20 20 20 20 20 "sdb............."
0010: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"