saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.11k stars 5.47k forks source link

lxc.init fails to bootstrap container #23772

Closed cheuschober closed 6 years ago

cheuschober commented 9 years ago

Happy to be proven wrong here, but from my best understanding, calls to lxc.init will always fail as long as it attempts to remove the seed marker prior to starting the container.

remove_seed_marker is set to True at: https://github.com/saltstack/salt/blob/develop/salt/modules/lxc.py#L1197

The bit that looks like it could use some reordering is: https://github.com/saltstack/salt/blob/develop/salt/modules/lxc.py#L1230-L1259

Steps to reproduce:

$ salt-call lxc.init testbuntu template=ubuntu options="{'release':'trusty', 'arch': 'amd64'}"

First failure is in attachable()

[ERROR   ] Command 'lxc-attach --clear-env -n testbuntu -- /usr/bin/env' failed with return code: 1

Versions report:

$ salt --versions-report
                  Salt: 2015.5.0
                Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
                Jinja2: 2.7.2
              M2Crypto: 0.21.1
        msgpack-python: 0.3.0
          msgpack-pure: Not Installed
              pycrypto: 2.6.1
               libnacl: Not Installed
                PyYAML: 3.10
                 ioflo: Not Installed
                 PyZMQ: 14.0.1
                  RAET: Not Installed
                   ZMQ: 4.0.4
                  Mako: 0.9.1
 Debian source package: 2015.5.0+ds-1trusty1
kiorky commented 9 years ago

You are totally wrong:

kiorky commented 9 years ago

(The seeding is meant to be used from the lxc runner) ...

cheuschober commented 9 years ago

@kiorky, Ok. I get that -- what about the steps to reproduce? If the error is misidentified, that makes sense -- still can't init a container.

kiorky commented 9 years ago

http://docs.saltstack.com/en/latest/topics/tutorials/lxc.html#initializing-a-new-container-as-a-salt-minion runner part.

kiorky commented 9 years ago

I think you may also have been misleaded from false positive debug output.

For exemple, the one you are giving is coming from here :
https://github.com/saltstack/salt/blob/develop/salt/modules/lxc.py#L2853

This tests if the container is attacheable yet but is not a failure, even if it outputs something to your logs.

cheuschober commented 9 years ago

@kiorky Still failed...

lxc.sls

lxc.container_profile:
  ubuntu:
    template: ubuntu
    options:
      release: trusty
      arch: amd64
cheuschober commented 9 years ago

The runner in the above scenario can't even fin the profile in pillar. I noticed, however, the documentation diverges in how to name the profiles between the example at:

http://docs.saltstack.com/en/latest/topics/tutorials/lxc.html#creating-a-container-on-the-cli

and

http://docs.saltstack.com/en/latest/topics/tutorials/lxc.html#tutorial-lxc-profiles-container

cheuschober commented 9 years ago

Minnion error log as below:

2015-05-15 17:57:59,315 [salt.loaded.int.module.cmdmod][ERROR   ][28404] Command 'lxc-attach --clear-env -n testbuntu-01 -- /usr/bin/env' failed with return code: 1
2015-05-15 17:57:59,316 [salt.loaded.int.module.cmdmod][ERROR   ][28404] output: lxc-attach: attach.c: lxc_attach: 635 failed to get the init pid
2015-05-15 17:58:00,148 [salt.loaded.int.module.cmdmod][ERROR   ][28404] Command 'lxc-attach --clear-env  --set-var PATH=/bin:/usr/bin:/sbin:/usr/sbin:/opt/bin:/usr/local/bin:/usr/local/sbin -n testbuntu-01 -- command -v salt-minion' failed with return code: 255
2015-05-15 17:58:00,149 [salt.loaded.int.module.cmdmod][ERROR   ][28404] stderr: lxc_container: attach.c: lxc_attach_run_command: 1094 No such file or directory - failed to exec 'command'
2015-05-15 17:58:00,149 [salt.loaded.int.module.cmdmod][ERROR   ][28404] retcode: 255
2015-05-15 17:58:00,170 [salt.loaded.int.module.cmdmod][ERROR   ][28404] Command "lxc-attach --clear-env  --set-var PATH=/bin:/usr/bin:/sbin:/usr/sbin:/opt/bin:/usr/local/bin:/usr/local/sbin -n testbuntu-01 -- test -e '/lxc.initial_seed'" failed with return code: 1
2015-05-15 17:58:00,170 [salt.loaded.int.module.cmdmod][ERROR   ][28404] retcode: 1
cheuschober commented 9 years ago

@kiorky If the attachable() issue is easily ignored, then should I assume the command -v call is really the problem here?

I did a manual attach and ran that command and it executed flawlessly but it won't run as above.

kiorky commented 9 years ago

just paste somewhere a full execution log if you are not able to interpret it.

I doubt there is a bug, i use it daily (salt/develop branch) but we also made sure with @terminalmage to have something stabilized in stable/2015.*.

kiorky commented 9 years ago

run with -lall (loglevel: all)

kiorky commented 9 years ago

i personally call the exec mod lxc.init from the runner lxc.cloud_init function. let me check (that's no how i use it personally, i have another internal interface to setup settings, that i feed the function with).

kiorky commented 9 years ago

removed comment: useless

kiorky commented 9 years ago

Forget what i said, i'm a bit tired :). You can use both, the the correct scheme is lxc.container_profile.

terminalmage commented 9 years ago

Yep, lxc.container_profile is checked first, and if not found then lxc.profile is checked.

cheuschober commented 9 years ago

@kiorky, Thank you for the documentation update. The profile has remained unchanged.

I called it as the following (from an Ubuntu 14.04 instance):

$ sudo salt 'salt' lxc.get_container_profile ubuntu
salt:
    ----------
    options:
        ----------
        arch:
            amd64
        release:
            trusty
    template:
        ubuntu

Running:

$ sudo salt-run lxc.init -lall --out yaml --out-file /vagrant/run-log.txt testbuntu-01 host=salt profile=ubuntu

Output

event:
  message:
    comment: ''
    done: []
    errors:
      salt:
        testbuntu-01:
        - changes:
            init:
            - create: Container created
            - config: Container configuration updated
            - state:
                new: running
                old: null
          comment: Bootstrap failed, see minion log for more information
          name: testbuntu-01
          result: false
    ping_status: false
    result: false
    salt: []
suffix: progress
comment: ''
done: []
errors:
  salt:
    testbuntu-01:
    - changes:
        init:
        - create: Container created
        - config: Container configuration updated
        - state:
            new: running
            old: null
      comment: Bootstrap failed, see minion log for more information
      name: testbuntu-01
      result: false
ping_status: false
result: false
salt: []

Relevant minion log:

2015-05-15 19:00:41,351 [salt.loaded.int.module.cmdmod][ERROR   ][796] Command 'lxc-attach --clear-env -n testbuntu-01 -- /usr/bin/env' failed with return code: 1
2015-05-15 19:00:41,352 [salt.loaded.int.module.cmdmod][ERROR   ][796] output: lxc-attach: attach.c: lxc_attach: 635 failed to get the init pid
2015-05-15 19:00:42,174 [salt.loaded.int.module.cmdmod][ERROR   ][796] Command 'lxc-attach --clear-env  --set-var PATH=/bin:/usr/bin:/sbin:/usr/sbin:/opt/bin:/usr/local/bin:/usr/local/sbin -n testbuntu-01 -- command -v salt-minion' failed with return code: 255
2015-05-15 19:00:42,174 [salt.loaded.int.module.cmdmod][ERROR   ][796] stderr: lxc_container: attach.c: lxc_attach_run_command: 1094 No such file or directory - failed to exec 'command'
2015-05-15 19:00:42,175 [salt.loaded.int.module.cmdmod][ERROR   ][796] retcode: 255
2015-05-15 19:00:42,198 [salt.loaded.int.module.cmdmod][ERROR   ][796] Command "lxc-attach --clear-env  --set-var PATH=/bin:/usr/bin:/sbin:/usr/sbin:/opt/bin:/usr/local/bin:/usr/local/sbin -n testbuntu-01 -- test -e '/lxc.initial_seed'" failed with return code: 1
2015-05-15 19:00:42,198 [salt.loaded.int.module.cmdmod][ERROR   ][796] retcode: 1
terminalmage commented 9 years ago

For the pillar data, keep in mind that it is only compiled in certain cases (minion start, when state.highstate or saltutil.refresh_pillar is called, maybe a couple others I am missing). You must refresh manually with saltutil.refresh_pillar if you make changes to your pillar if the next thing you do is an lxc.init which needs that pillar data.

Regarding your error, lxc-attach: attach.c: lxc_attach: 635 failed to get the init pid is the error you would get if you tried to attach to a container that is not running. However, the runner says that the container is running. Perhaps it is not yet running when lxc.init moves on to perform the bootstrap, though I can't see how that would happen since I believe the lxc.start function (which is used to start up the container) blocks until the container is up. Try adding a bootstrap_delay=5 to your lxc.init command and see if that changes anything.

cheuschober commented 9 years ago

@terminalmage

Ok, so I just found this little note in the documentation:

Warning

Many shell builtins do not work, failing with stderr similar to the following:

lxc_container: No such file or directory - failed to exec 'command'

The same error will be displayed in stderr if the command being run does not exist. If the retcode is nonzero and not what was expected, try using lxc.run_stderr or lxc.run_all.

http://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.lxc.html#salt.modules.lxc.retcode

That is, however, what appears to be called at:

https://github.com/saltstack/salt/blob/develop/salt/modules/lxc.py#L2644

Bug?

cheuschober commented 9 years ago

@terminalmage, I tried it with the delay but had the same result.

cheuschober commented 9 years ago

@terminalmage, I still think the first retcode 1 was from:

https://github.com/saltstack/salt/blob/develop/salt/modules/lxc.py#L1230-L1259

I did a little log peppering before I first reported this and when run in the above scenario, it hit the run statement at L1231 before it hit the start at L1259. Since run() (and _run()) call attachable(), the first error appears to be related to that. @kiorky has corrected me to note that it does have fallback to chroot so it appears to be relatively benign (though it does complicate debugging a bit to have any spurious ERROR messages like that). Why the rest of it fails is a bit confusing.

terminalmage commented 9 years ago

@cheuschober It does look like command -v does not work in lxc-attach, I've replaced it with which.

# salt saltmine lxc.run ubuntu1404 'command -v salt-minion'
saltmine:
    lxc-attach: attach.c: lxc_attach_run_command: 1107 No such file or directory - failed to exec 'command'
# salt saltmine lxc.run ubuntu1404 'which salt-minion'
saltmine:
    /usr/bin/salt-minion
terminalmage commented 9 years ago

Pull request for that change is https://github.com/saltstack/salt/pull/23782

cheuschober commented 9 years ago

@terminalmage, Thanks Erik.

Still can't build a container successfully, though:

2015-05-15 19:46:04,493 [salt.loaded.int.module.cmdmod][ERROR   ][4312] Command 'lxc-attach --clear-env -n testbuntu-01 -- /usr/bin/env' failed with return code: 1
2015-05-15 19:46:04,494 [salt.loaded.int.module.cmdmod][ERROR   ][4312] output: lxc-attach: attach.c: lxc_attach: 635 failed to get the init pid
2015-05-15 19:46:09,820 [salt.loaded.int.module.cmdmod][ERROR   ][4312] Command 'lxc-attach --clear-env  --set-var PATH=/bin:/usr/bin:/sbin:/usr/sbin:/opt/bin:/usr/local/bin:/usr/local/sbin -n testbuntu-01 -- which salt-minion' failed with return code: 1
2015-05-15 19:46:09,821 [salt.loaded.int.module.cmdmod][ERROR   ][4312] retcode: 1
2015-05-15 19:46:09,842 [salt.loaded.int.module.cmdmod][ERROR   ][4312] Command "lxc-attach --clear-env  --set-var PATH=/bin:/usr/bin:/sbin:/usr/sbin:/opt/bin:/usr/local/bin:/usr/local/sbin -n testbuntu-01 -- test -e '/lxc.initial_seed'" failed with return code: 1
2015-05-15 19:46:09,842 [salt.loaded.int.module.cmdmod][ERROR   ][4312] retcode: 1
cheuschober commented 9 years ago

I can confirm the container is running and accepts the attach commands directly as above.

terminalmage commented 9 years ago

@cheuschober I added a commit to that pull request which will get rid of the spurious "failed to get the init pid" message. I'll keep looking here.

cheuschober commented 9 years ago

@terminalmage Thank you so much!

terminalmage commented 9 years ago

@cheuschober It looks like all the log messages in you posted above are spurious (in other words, a nonzero return code does not necessarily mean that there is a problem), we can use ignore_retcode=True to suppress these and keep them from being seen as errors.

cheuschober commented 9 years ago

@terminalmage Thanks. I'm just realizing that now. There's a lot more interesting output to be had with salt-call lxc.init -- I want to look into something I'm seeing that could indicate a networking problem. I didn't realize this called out to the saltstack website to bootstrap, I (incorrectly) assumed the bootstrapper was shipped from the master to the cloud target.

Still it's good to get those spurious messages out of the way since they clearly took me down a path to chase the wrong goose.

terminalmage commented 9 years ago

Yeah, really sorry about that. I thought we caught all of these, but some of them slipped through. Thanks for helping us to track down the remaining ones.

cheuschober commented 9 years ago

No worries! It's how software gets better! :)

terminalmage commented 9 years ago

I totally agree! :smile:

terminalmage commented 9 years ago

@cheuschober OK, I've gotten rid of the remaining spurious log messages you posted.

lxc.init does rely on networking because it executes salt-bootstrap

cheuschober commented 9 years ago

@terminalmage: Awesome. So here's the networking thing: I would have expected the above profile to just work out of the box since Ubuntu does create an lxcbr by default. Interestingly, lxc.create ubutest-01 profile=ubuntu does as expected but lxc.init ubutest-02 profile-ubuntu appears to blow out the networking profile. Similarly, if I init a pre-created container it also blows out the default networking profile.

Is that intentional?

cheuschober commented 9 years ago

Here are some sample configuration differences between the two:

Created via salt-call lxc.init testbuntu-01 profile=ubuntu:

# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu
# Parameters passed to the template: --release trusty --arch amd64
# For additional config options, please look at lxc.container.conf(5)

# Common configuration
lxc.include = /usr/share/lxc/config/ubuntu.common.conf

# Container specific configuration
lxc.rootfs = /var/lib/lxc/testbuntu-01/rootfs
lxc.mount = /var/lib/lxc/testbuntu-01/fstab
lxc.utsname = testbuntu-01
lxc.arch = amd64

# Network configuration
lxc.start.auto = 1

Created via salt-call lxc.create testbuntu-02 profile=ubuntu

# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu
# Parameters passed to the template: --release trusty --arch amd64
# For additional config options, please look at lxc.container.conf(5)

# Common configuration
lxc.include = /usr/share/lxc/config/ubuntu.common.conf

# Container specific configuration
lxc.rootfs = /var/lib/lxc/testbuntu-02/rootfs
lxc.mount = /var/lib/lxc/testbuntu-02/fstab
lxc.utsname = testbuntu-02
lxc.arch = amd64

# Network configuration
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = lxcbr0
lxc.network.hwaddr = 00:16:3e:2c:88:c7
kiorky commented 9 years ago

what do you mean by blowing the profile ? in the lxc config ?

On the output you get from logs ?

For the networking, be sure to configure something to use 'lxcbr0' which is the default bridge on ubuntu, take also a look on your template /etc/network/interfaces to check weither it uses dhcp or not, and verify that there is a dhcp server serving your bridge, normally on ubuntu, the lxcnet service gives you a dnsmasq configured for that purpose.

kiorky commented 9 years ago

Seems our comments crossed, you ll need to attach a network profile, did you configured one ? See https://github.com/terminalmage/salt/blob/0f6f239052074eed083202acf9db77d72a77c6f2/salt/modules/lxc.py#L446

kiorky commented 9 years ago

We had a discussion with @terminalmage about the network profile on weither to configure it by default, but it is impossible to do as it is distribution and user specific: dhcp ? which bridge ? which gw ? and etc.

kiorky commented 9 years ago

I just did realized that you are using the runner from one of the comments above (i didnt backlog at first the whole discussion).

kiorky commented 9 years ago

This should be enougth

       lxc.network_profile.ubuntu:
          eth0:
            link: lxcbr0
            type: veth
            flags: up

then use

salt-run lxc.init -lall testbuntu-xx host=salt profile=ubuntu network_profile=ubuntu
cheuschober commented 9 years ago

@kiorky, That worked. Thank you so much for your direction and time. (I had just tested that independently). I guess what threw me was that lxc.create does not have the same behavior in regards to network profiles as lxc.init. Would it be reasonable to have an explicit warning of that fact in the documentation (that network_profile is required for init and not for create)?

kiorky commented 9 years ago

Well, that is the hard question to answer, lxc-create is a different beast and really really more raw than lxc.init, so we can make some shortcuts.

lxc.init is more the assembly for tools to assemble their container based on configuration options and yes a bit tougth the first time you use it.

It is versatile, and usable from many options, eg, i do not use it like you :).

I think we should at least complete the documentation to state correctly that lxc.init should really be used from the runner, and indicate clearly that a profile, and a network profile must be configured as prerequisites.

cheuschober commented 9 years ago

Indeed. Thank you again.

terminalmage commented 9 years ago

Network profiles are not required for lxc.create either, it's just that if one isn't provided, Salt uses a fallback profile. The reasoning for this is that many LXC images come with networking set to "empty".

cheuschober commented 9 years ago

@terminalmage Yup. It was just the difference in how lxc.create would use that fallback and lxc.init would not that threw me. This was all a long road of trying to debug some problems with an lxc-cloud issue for me and as I decomposed the problem into what I thought of as its component parts I realized there were some fairly magical things happening in init

terminalmage commented 9 years ago

If init is not using the fallback, we should make it do so. What do you think, @kiorky?

kiorky commented 9 years ago

Im not sure at all that it can be done easily without breaking compatiblity and behavior.

init is made to work with explicit settings and overwrite the lxcconfig with it's own set of lxc settings as said in an other comment.

on the contrary, lxc.create only applies a network profile on top of what's created by the stock lxc utils.

Thus, what we could do is to try to find a bridge to attach to (search lxcbr, then br) and setup that if no profile was selected, instead of assuming no network.

cheuschober commented 9 years ago

@kiorky, So I was still having some trouble with my cloud init and came back to the networking issue.

When I last stated it as working I must have called

salt-call lxc.init testbuntu-01 profile=ubuntu network_profile=ubuntu

This worked, as hoped. But I just tried the runner version and found that network_profile appeared to be ignored.

salt-run lxc.init testbuntu-01 host=salt profile=ubuntu network_profile=ubuntu

In the above case, the network profile was not applied. I tried the network profile in both the format you suggested as well as the hierarchical form below:

lxc.network_profile:
  ubuntu:
    eth0:
      link: lxcbr0
      type: veth
      flags: up

lxc.container_profile:
  ubuntu:
    template: ubuntu
    options:
      release: trusty
      arch: amd64
kiorky commented 9 years ago

When you have an error, please attach the FULL log from the start of your experiences as without, we cant help you much.

kiorky commented 9 years ago

@terminalmage the patchs you done are on 2015.5, are on a portion which conflicts with develop as it was migrated to container_resources, im checking if we also need to make the same patchs in container_resources.

kiorky commented 9 years ago

@cheuschober im currently testing your use case, from end to end, to see what's going on.