turnkeylinux / tracker

TurnKey Linux Tracker
https://www.turnkeylinux.org
68 stars 16 forks source link

Cloud build - 25ec2-userdata and other "cloud" inithooks failing on some OpenStack infrastructure #1385

Open JedMeister opened 4 years ago

JedMeister commented 4 years ago

@deutrino - FWIW this is a consolidated TurnKey issue tracker. Please post all TurnKey related issues here.

Reposting what you posted in the forums, re the PostgrreSQL appliance:

[...] it hung for a while on first boot. Error message:

[  802.762617] inithooks[336]: Traceback (most recent call last):
[  802.784061] inithooks[336]:   File 
"/usr/lib/inithooks/firstboot.d/25ec2-userdata", line 40, in <module>
[  802.791844] inithooks[336]:     main()
[  802.802152] inithooks[336]:   File 
"/usr/lib/inithooks/firstboot.d/25ec2-userdata", line 28, in main
[  802.818551] inithooks[336]:     userdata = ec2metadata.get('user-data')
[  802.828055] inithooks[336]:   File 
"/usr/lib/python2.7/dist-packages/ec2metadata/__init__.py", line 85, in get
[  802.838756] inithooks[336]:     m = EC2Metadata()
[  802.852056] inithooks[336]:   File 
"/usr/lib/python2.7/dist-packages/ec2metadata/__init__.py", line 32, in 
__init__
[  802.871975] inithooks[336]:     raise Error("could not establish 
connection to: %s" % self.addr)
[  802.893493] inithooks[336]: ec2metadata.Error: could not establish 
connection to: 169.254.169.254

I assume that this was an AWS server? The address 169.254.169.254 is a static IP which should be bonded to all AWS instances and is how you access instance metadata. In this particular case, it appears to have been unable to connect for some reason? Could be a race condition, or it could be some intermittent AWS fault. I recommend that you retry launching again, but this time if possible, in a different zone to the server that failed.

JedMeister commented 4 years ago

@deutrino - I just discovered your unanswered post on the forums (I thought I'd already responded, but obviously not).

I then noticed that I had opened this issue (wish I'd noticed it sooner as I could have saved myself a little time...).

Anyway, I'm wondering if you had any further info to add to this? Or whether it's "resolved itself"?

deutrino commented 4 years ago

Hi, this was not on Amazon so I'm not sure why it was contacting EC2. I only get to work on this project every couple weeks (though that will likely change soon) so it's tough to remember offhand but it was either in VMware player on my laptop, or on a VPS from a smallish provider.

JedMeister commented 4 years ago

Thanks for the additional info @deutrino.

Actually, I stand corrected, it's not just AWS. Openstack also use that same IP (for the same purpose).

So I'm guessing it may have been a third party provider who are using Openstack?!

As some additional info, the /usr/lib/inithooks/firstboot.d/25ec2-userdata inithook is added by buildtasks (the tool that we use to build the alternate build types from the ISOs that are built by default). It's specifically provided by the cloud patch - you can find the file itself in the overlay - here.

deutrino commented 4 years ago

I am experiencing this now on the Mattermost appliance as well. Same provider, this time I know it's running OpenStack because I just created a new instance before trying to install the TKL app. I uploaded the .qcow2 image file to my provider. So this is the same problem on a different Turnkey app, same small provider (not AWS or DigitalOcean).

Also, /usr/lib/inithooks/firstboot.d/25ec2-userdata takes a LONG time to time out on first boot.

Edit: it might be good to change the title of this issue to something a bit more accurate too.

deutrino commented 4 years ago

Looks like this is also happening in

/usr/lib/inithooks/firstboot.d/40ec2-sshkeys
/usr/lib/inithooks/everyboot.d/25ec2-userdata-idchange

and taking ~10 minutes to time out and fail in each case.

I'm not really familiar with OpenStack so I don't understand what's happening under the covers, but I do recall settings in the control panel having to do with user data & ssh keys. It would appear that when attempting to talk to my provider's OpenStack infrastructure, some scripts are failing and taking over 10 minutes to time out each time they do so!

As one would expect, the root password I set in my provider's control panel when deploying the instance is not applied. I'm not trying to init user data thru the control panel although the UI is there to do so. But I would assume that's not working either. I haven't investigated for further effects.

JedMeister commented 4 years ago

Yes, if you are using our OpenStack builds on OpenStack infrastructure which does not provide the OpenStack Metadata Service (i.e. provide metadata via that static bonded IP address) then this issue will occur on all of our appliances. As such, I'll change the subject of this issue and tag it 'openstack'.

Re getting it up and running, here are a few suggestions OTTOMH:

chmod -x /PATH/TO/ROOT/OF/QCOW2/usr/lib/inithooks/firstboot.d/25ec2-userdata

I doubt that the other inihook scripts included in our "cloud" builds shouldn't cause any issues, although they possibly don't bring any value either, so they could also be removed/disabled?

Unless you can confirm that this is a bug in the infrastructure you're using, it might be best if we check for that IP at the start of that inithook, so it will fail gracefully if not present. That won;y happen until our next release though (which will be v16.0).

JedMeister commented 4 years ago

Oops, sorry, I missed your most recent message when I replied.

Thanks for the updated info on those other scripts failing.

I suspect that your current provider uses cloud-init exclusively to provide SSH keys, etc and isn't providing the metadata service which our inithooks require when run on infrastructure like that. FWIW whilst cloud-init is really common these days, strictly speaking it's not an OpenStack requirement. We don't include cloud-init and it does not play nice with out inithooks.

It's a known issue and one we'd love to address, but as we already have a ton of infrastructure that does rely on our inithooks, including cloud-init isn't quite so simple...

In the meantime, you may need to do something like pre-include your public key in the image before uploard (as per my note on mounting the qcow2 and disabling/removing the problematic inithook). Those other troublesome intihooks may also need a tweak as per my suggestions above.

deutrino commented 4 years ago

Hey, so, here is what support had to say. I don't really know OpenStack so I'm just going to quote:

... the solution might be as simple as specifying the img_config_drive key with property mandatory for your image properties [link to docs]

This didn't work, so he played with the TurnkeyLinux image and came up with this:

Looks like the problem is a route is not being set to the meta-data IP. I can connect to it once I do ip route add 169.254.169.254 dev eth0. However, without cloud-init (which our cloud images have by default), I'm not sure how you'd do that at launch.

So it sounds like this might be a bug (or simply an oversight) in TKL? And if so, the answer to his question would be to update the build script in such a way that the route is set?

deutrino commented 4 years ago

I'm willing to build my own images and test, and I still probably have TKLDev installed and everything, but I've completely forgotten the process since I last played with it. If there's a specific file I can edit after cloning a given appliance (currently, for me, Mattermost and PostgreSQL), and then just feed that to TKLDev and produce a .qcow2, I can give it a shot once I read the docs again.

deutrino commented 4 years ago

I guess what I don't understand how to do is have TKLdev use a particular cloud patch which I've altered. It seems like fixing that patch would be the best way to fix this, yes?

JedMeister commented 4 years ago

Do you mean TKLDev? (TKLBAM is our built in backup tool).

Assuming so, log in via SSH, and:

cd buildtasks

Then remove (or disable) the relevant inithooks. All the ones you've noted are in the "cloud patch" overlay (i.e. within buildtasks, patches/cloud/overlay/usr/lib/inithooks/firstboot.d - note there is also one in patches/cloud/overlay/usr/lib/inithooks/everyboot.d too).

Alternatively, you could create your own custom buildtasks script by editing/copying the existing bt-openstack and removing the line that calls the "cloud" patch altogether. Although the conf script does enable eth1 which you may still want?

Once your done there, run the OpenStack buildtask for the relevant appliance. You'll need to explicitly note the full TurnKey appliance name and version (FWIW we often refer to that as the "appver"). I.e to build an OpenStack image of the latest for Postgres:

./bt-openstack postgresql-15.0-stretch-amd64

And Mattermost:

./bt-openstack mattermost-15.2-stretch-amd64

That should download the ISO from our mirror, pull it apart and apply the relevant patches (but not the files I note above - assuming you remove them; as I say disabling them should work too).


Some additional notes:

You still won't have cloud-init, so you'll still need to access a root terminal for the first boot scripts somehow. Your previous posts suggest that you may still be able to get access, although perhaps that's just read only (or from the logs or something)? Regardless, assuming these are only for your personal usage, one easy way to allow easy root access via SSH would be to include your public SSH key to the root user on your builds. To do that, add it to the relevant file in one of the patch overlays. I'd suggest that you add it to the OpenStack one. E.g. assuming that I was adding my key (obviously don't actually add my key!) it'd go something like this (assuming still in the top level of buildtasks):

SSH_PUB_KEY="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCvBup9/16HsrT4a+WxkjNScbYCueCJVCkxHtLRnUEOYH7HTQfybmy4JSRkPiyS4ig7Jk9lAoFLiLtEfI4l93iQsr/gp3SAjKftxso5Wf/BtOVinkzue66tq+4EZcji+cezg3JfqKz2NlCnmcFh3/8ohTk+i8EM10fzaN41cYGL4/x/G0I2wA1bfPR4dZe2ybUIoHzfzYJzOqgv9knNIxXBACQyXiN801hNdKhq7Snk8kBrc+IxoLZIvqTYXIFeSDLSn5L9n8plLQnhWhEg72cUyoJGBHofy+RGpHcaKBnuEXtLrXpqzpg23X+L2zwry9a/GMeMS9u+sXB2PD5fym4N jeremy@turnkeylinux.org"

mkdir -p patches/openstack/overlay/root/.ssh
echo "$SSH_PUB_KEY" >> patches/openstack/overlay/root/.ssh/authorized_keys

Alternatively, you could try pre-installing cloud-init, although I can't guarantee that will work smoothly. Although if you try it, please let me know how it goes.


Also please note that we are midway through our transition to a new major release, v16.x (based on Debian 10/Buster). I jumped the gun a bit and have merged some v16.x stuff into some master branches.

If you're using an old (so long as it's v15.0) TKLDev server and/or are only building OpenStack builds (from ISOs from the mirror, as buildtasks does by default when you run any ./bt-... script other than ./bt-iso) you should be good. But if you are setting up a new TKLDev and building anything from scratch (i.e. building local ISO etc) you'll need to checkout the 15.x brach of common (and perhaps some other stuff?):

cd common
git fetch origin
git checkout --track origin/v15.x

Buldtasks itself will be fine (we won't be touch that until we at least have a v16.0RC of Core) and assuming that you are building your OpenStack should be fine too.

deutrino commented 4 years ago

This is so helpful, thanks. I do indeed have a TKLDev VM from a couple months ago. And there's VNC so I can log in directly. As long as removing those inithooks is no problem I'll stick to that for the time being. I did just use the ISO in one case (works fine) but it would be nice to have native Openstack images that work with my provider until 16.x comes out.

OnGle commented 2 years ago

@JedMeister ping on this?

JedMeister commented 2 years ago

TBH, I'm not really sure... We really need someone with OpenStack to help. But not having a v16.x OpenStack build doesn't really help...