lavabit / robox

The tools needed to robotically create/configure/provision a large number of operating systems, for a variety of hypervisors, using packer.
636 stars 140 forks source link

generic/ubuntu1804 default locale is Latin-1 #78

Closed mgedmin closed 5 years ago

mgedmin commented 5 years ago

I've discovered that /etc/default/locale on generic/ubuntu1804 contains

LANG="en_US"
LANGUAGES="en_US:"

This causes problems, e.g. I cannot create PostgreSQL databases using UTF-8 because the system locale uses Latin-1.

(Some of the problems are masked when you use vagrant ssh because SSH copies LANG and LC_* from your host system.)

I believe stock Ubuntu defaults to UTF-8 locales and I think the generic boxes should too.

ladar commented 5 years ago

Seems reasonable. I think the relevant line is:

d-i debian-installer/locale string en_US

Which should change to what?

d-i debian-installer/locale string en_US.UTF-8

Which is what I see on 18.04. Then the question becomes which Debian/Ubuntu versions should this change be made to? I'd hate to apply the update to every Ubuntu/Debian config, only to have a bunch of the boxes fail during the next run.

mgedmin commented 5 years ago

Ubuntu has used UTF-8 by default nearly since the very beginning: https://wiki.ubuntu.com/UTFEightByDefault mentions this being a release goal for Hoary Hedgehog aka Ubuntu 5.04, released in 2005, the second ever Ubuntu release.

ladar commented 5 years ago

I know UTF-8 support has been around for awhile. What I'm asking is whether the change above results in setting the collation to en_US, but the character set to UTF-8, and thus fixes your issue.

And if this is the fix, will all of the different installers accept it, given that it appears on the surface to use a different format, specifically it goes from COLLATION to COLLATION.CHARSET with a period between them.

I can't make the change without testing first. Any chance you can run a test?

mgedmin commented 5 years ago

The workaround I applied in the mean time was rewriting /etc/default/locale in my Vagrant provisioning scrpts to contain LANG="en_US.UTF-8". For Ubuntu's PostgreSQL specifically this needs to be done before apt installing the postgresql-server package (because that package's install script creates a cluster configuration that remembers the system locale used during creation, for I don't know what reason -- I suppose the on-disk data structures assume a particular collation order or something).

I'm not entirely sure I understand what you mean by COLLATION -- the strings passed to debian-installer/locale are glibc locale names. The format for them (documented in the setlocale manual page) is language[_TERRITORY][.charset][@modifier] with the territory, charset, and modifier parts being optional. You can run locale -a to see a list of locale names supported by the system. (The charset part is passed through a normalization step, so that both UTF-8 and utf8 mean the same thing. Modifiers were used by things like adding the Euro symbol into an otherwise Latin-1 locale in the bad old days when 8-bit charsets reigned supreme.)

So, I'm reasonably certain that changing en_US to en_US.UTF-8 would work for all ubuntu and debian boxes, but I agree that it would be irresponsible to push new images without testing. I'd appreciate some help with that: I read the README, ran res/providers/packer.sh, and it failed for me:

# github.com/hashicorp/packer/common/net
../../hashicorp/packer/common/net/configure_port.go:61:2: undefined: net.ListenConfig
error: pathspec 'except_post_processor_tests' did not match any file(s) known to git
res/providers/packer.sh: line 50: scripts/build.sh: No such file or directory
mgedmin commented 5 years ago

The failure confuses me because ~/go/src/github.com/hashicorp/packer/scripts/build.sh most definitely exists. It fails, though:

$ cd ~/go/src/github.com/hashicorp/packer 
$ export GOPATH=$HOME/go/
$ PATH=$GOPATH/bin:$PATH
$ XC_ARCH=amd64 XC_OS="windows darwin linux" scripts/build.sh
3 errors occurred:
--> linux/amd64 error: exit status 2
Stderr: # github.com/hashicorp/packer/common/net
common/net/configure_port.go:61:2: undefined: net.ListenConfig

--> darwin/amd64 error: exit status 2
Stderr: # github.com/hashicorp/packer/common/net
common/net/configure_port.go:61:2: undefined: net.ListenConfig

--> windows/amd64 error: exit status 2
Stderr: # github.com/hashicorp/packer/common/net
common/net/configure_port.go:61:2: undefined: net.ListenConfig

==> Copying binaries for this platform...
find: ‘./pkg/linux_amd64’: Toks failas ar aplankas neegzistuoja

==> Results:
viso 0
ladar commented 5 years ago

Looks like an issue with the packer build script. If I were to guess, it's because you're missing a required golang dependency or have an incompatible version of go. You could open an issue with packer if you're interested.

That said, at the moment, you shouldn't need any patches to build the Robox config files, so the current released version 1.4.2 should work just fine. Just grab it here.

If you want to verify things before kicking it off, you can run ./robox.sh validate which will validate that the packer version accepts the current JSON. You can also run ./robox.sh links to make sure all the ISO URLs are still good. In this case you can ignore any non-Debian/non-Ubuntu problems.

Once you have packer setup you can edit the auto install files in the http directory, and run ./robox.sh box generic-BOX-PROVIDER ... for a specific config, or to build all the boxes for the Debian and Ubuntu configs, you can run the command:

./robox.sh box generic-debian8-libvirt,generic-debian9-libvirt,generic-debian10-libvirt,generic-ubuntu1604-libvirt,generic-ubuntu1610-libvirt,generic-ubuntu1704-libvirt,generic-ubuntu1710-libvirt,generic-ubuntu1804-libvirt,generic-ubuntu1810-libvirt,generic-ubuntu1904-libvirt

Which will build the libvirt version of all the relevant boxes. For this type of change, one provider should be sufficient, so feel free to swap libvirt for virtualbox or vmware as appropriate. Note you cannot run libvirt and virtualbox at the same time.

The command above will build the boxes one at a time to avoid having any fail due to net,cpu,disk load. Depending on your system, ie notebook, vs desktop vs server, SSDs, RAM, etc you can probably break up the list of boxes, and run jobs in parallel. My 4 year old workstation class Thinkpad (W540) can handle 2-4 builds at the same time. A desktop/server with SSDs, gigabit, could probably handle significantly more.

P.S. I don't use Postgres much, but for my MariaDB/MySQL config scripts, I usually generate/install my own /etc/my.cnf file (path changes as appropriate), and you can set the default collation/character set via that file.

As for collation, it refers to how the system does character comparisons, aka upper case, vs lower case vs straight binary comparisons. And it becomes important when switch to UTF-8. I believe the locale dictates how the OS handles this, but it's far more important to get right in your database.

ladar commented 5 years ago

@mgedmin I should also mention, that with MariaDB/MySQL you can also set a default collation/charset for a database schema and/or table, which is what I do for magma ... like so.

mgedmin commented 5 years ago

FWIW

$ ./robox.sh links
...
Link Failure:  https://mirror.leaseweb.net/ubuntu-cdimage/releases/18.10/release/ubuntu-18.10-server-amd64.iso

because Ubuntu 18.10 is End of Life, I suppose. (There are also alpine and a couple of FreeBSD failures.)

./robot box fails for me because

Build 'generic-ubuntu1904-libvirt' errored: Failed creating Qemu driver: exec: "/usr/libexec/qemu-kvm": stat /usr/libexec/qemu-kvm: no such file or directory

My laptop has Ubuntu 19.04, which does not have /usr/libexec/qemu-kvm. There is a /usr/bin/kvm. libvirt itself works fine -- I've switched to generic/ubuntuNNNN boxes just so that I could use vagrant-libvirt instead of virtualbox.

Should I edit generic-libvirt.json and try again? (Trick question: I tried it already, and now I'm looking at the script downloading debian ISO files.)

mgedmin commented 5 years ago

Ha ha I forgot to actually change the locale strings in http/generic*.cfg before kicking off the box builds. Restarting.

What sort of tests would you like me to perform on the built .box files?

mgedmin commented 5 years ago

Ouch, the box names must be separated with commas, not spaces, so I cannot do things like ./robox.sh box generic-debian{8,9,10}-libvirt :(

mgedmin commented 5 years ago

Why does the ./robox.sh script exit with a non-zero status code after successfully building a couple of boxes?

What are the differences between output/generic-*.box and the corresponding output/roboxes-*.box?

ladar commented 5 years ago

Lots of questions. Lots of answers, mostly.

I updated the 18.10 URL yesterday. Or at least I thought I did. I just realized my find/replace missed the URL in the generic-virtualbox.json file.

I'll need to keep eye out, as the cosmic packages will disappear from the mirrors soon, and that will require a similar tweak to the installer file config. I'm actually thinking about adding a check for that to the ./robox.sh links function. It will be pain though, because I won't be able to generate the list of URLs to check dynamically.

As of this second, all of the URLs are working. But if you keep helping out, know that the Gentoo URL breaks often (sometimes daily, as it gets rebuilt automatically), and the Arch URL changes monthly. If you need to find/update those URLs, the ./robox.sh isos will supply the correct URL/SHA values. At some point I might figure out how to use jq to update the JSON, but for now it's a manual find/replace.

As for the qemu-kvm question, that is the path to qemu-kvm on CentOS. Naturally it's different on Ubuntu. Just update the path in the JSON file to match the location on your system.

As for tests, I don't have a good answer. The biggest test is whether or not it goes through the install and all the config modules without throwing any (unexpected) errors. Having an auto-install config hang is a pretty big deal, as packer will wait 1 to 4 hours (depending on platform/box) before it times out, and moves onto the next box. So if installer doesn't like the value for all of the Ubuntu boxes that adds (at a minimum 86 hours to the build). Having a config script fail later on is also bad, because it means I'll need to retry that box config, and if it fails a second time, build it locally and troubleshoot. A painful process.

Naturally making sure the resulting box file will also work with vagrant up && vagrant ssh is a critical test. That's at a minimum. Beyond those things, I don't have any good answers, as I mostly use latin1, so I can't suggest any additional test cases you can do with latin1 vs utf8. Naturally seeing if it fixes your Postgres issue is worth checking, but that is an isolated issue. I actually have a set of test scripts that I use pull down the boxes and do the above, plus run a few commands, but the process is primitive, and I haven't been able to work on it since I started my trip in May. The internet is too slow to pull down all the box files and test. My Jenkins server is virtualized which makes testing the images difficult. I'm hoping to have access to a Jenkins server with physical nodes soon, so I can test more regularly. My test cases right now are:

  # (testcase vagrant upload Vagrantfile) &>> $1-$2-$3.txt; error $1 $2 $3
  # # (testcase vagrant upload .vagrant) &>> $1-$2-$3.txt; error $1 $2 $3
  # (testcase vagrant ssh -- exit 0) &>> $1-$2-$3.txt; error $1 $2 $3
  # (testcase vagrant ssh -- echo "\$SHELL" | grep -q bash) &>> $1-$2-$3.txt; error $1 $2 $3
  # # (testcase vagrant ssh --command "if [ ! -f Vagrantfile ] || [ ! -d .vagrant ]; then exit 1; fi") &>> $1-$2-$3.txt; error $1 $2 $3
  # (testcase vagrant ssh -- "if [ ! -f Vagrantfile ]; then exit 1; fi") &>> $1-$2-$3.txt; error $1 $2 $3
  # (testcase vagrant ssh -- "ping -c 4 lavabit.com") &>> $1-$2-$3.txt; error $1 $2 $3
  # (testcase vagrant ssh -- "curl --silent --user-agent \"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\" --output /dev/null --url https://lavabit.com") &>> $1-$2-$3.txt; error $1 $2 $3
  # (testcase vagrant ssh -- "sudo -- touch /test.dat && sudo touch /etc/test.dat && sudo -- bash -c 'echo TestOption no >> /etc/ssh/sshd_config'") &>> $1-$2-$3.txt; error $1 $2 $3
  # (testcase vagrant ssh -- "(which grep && which curl && which cat && which date && which ping && which awk && which sed && which ssh && which man && which ps && which vim) > /dev/null; exit \$?") &>> $1-$2-$3.txt; error $1 $2 $3

The arguments are ORG BOX PROVIDER.

The robox.sh script started as a simple tool to config the environment variables, and run packer against all the growing number of JSON files. It's grown into massive bash script since then, and I have several things I'd like to add. Namely integrating my standalone upload/release/verify/add scripts as functions in robox.sh, along with the box testing logic I currently have in a separate project. I'll also add logic that will auto-retry failed builds when running one of the meta functions (generic/magma/lineage or vmware/libvirt/parallels/hyperv/docker/virtualbox or some combination of the two) at some point. But I tend to only add features when doing things manually gets painful enough, and I manage to find the time.

Globbing support right now feels like more effort than it's worth. But with that said, the commas are only needed to easily parse the list into a bash array. If your so inclined you could update the box() function to support both methods and submit a pull request.

Personally, I like to run ./robox.sh generic-virtualbox (or ./robox.sh generic or ./robox.sh all) to build machines, so I only need the box target to rebuild failures. What I do is run ./robox.sh missing | grep "Box - " which gives me a list I can massage via a text editor.

I'm not sure why robox.sh isn't returning a proper status code. I think I fixed this issue once upon a time, but ended up switching it back because I think it broke something else. I don't recall precisely what, but I believe the issue is with how bash forks or doesn't fork, but your right, it should return the right status code.

If you look at the box() you'll see that it basically runs the box list against all the different JSON files. The issue could be that it's running the config against a JSON file with no matching box names, and attempt happens last, so the error code is what bubbles up. Just a guess though.

As for generic vs robox they are virtually identical. When I started building these images ~3 years ago, they were all called "generic" ... which is why the generic images have more downloads. When I finally put everything on GitHub (I had to sanitize my initially internal repo before I could make it public), I needed a unique moniker, because I couldn't call the repo packer anymore. I came up with Robot Boxes or roboxes for short. Eventually I started releasing the images under both names. On other platforms, namely Docker Hub, the images are only in the roboxes namespace (I couldn't register generic).

Initially it packer built each config twice, which got very painful. I finally spent a few hours and figured out how I could write out the same artifact under two different names, and that is what we have today.

mgedmin commented 5 years ago

Thank you for all the answers!

Here are the changes I've tested: https://github.com/mgedmin/robox/commits/mg. Three commits:

I tested box builds for all the boxes I touched, using the libvirt backed. All builds completed successfully. (Except the one where I connected a vnc viewer to the printed vnc:// URL out of curiosity and discovered that the build scripts also want a vnc connection to navigate the boot menus, and, well, when you connect a new vnc client, the old one gets kicked out. The build succeeded when I retried it.)

I haven't tested whether vagrant up works yet because I don't remember how to do that when you have a box file on disk rather than a name to fetch from Vagrant Cloud. (I will look it up.)

I think I see where the exit code comes from -- the last command in the box() function is

[[ condition ]] && do something

and when condition is false, well, that leaves a stale exit code. I would suggest adding a return 0 at the very end of the box() function but yeah, if you want the exit status to indicate whether there were any errors, that'd be a more involved process.

ladar commented 5 years ago

Can you submit a PR with just the locale change? The KVM change doesn't apply to CentOS, and the URL update is already done.

I haven't tested whether vagrant up works yet because I don't remember how to do that when you have a box file on disk rather than a name to fetch from Vagrant Cloud. (I will look it up.)


vagrant box add PATH

Then vagrant init NAME && vagrant up && vagrant ssh and it should use the local file and not the cloud version. You might need to import it with a higher version number. If it's older than the cloud it might ask whether you want to auto update.

and when condition is false, well, that leaves a stale exit code. I would suggest adding a |return 0| at the very end of the box() function but yeah, if you want the exit status to indicate whether there were any errors, that'd be a more involved process.

Done. I suppose always getting 0 is better than a useless error code.