lxc / lxc

LXC - Linux Containers
https://linuxcontainers.org/lxc
Other
4.61k stars 1.12k forks source link

Oracle RAC 18c Grid Infrastructure in LXC Containers #2929

Closed gstanden closed 5 years ago

gstanden commented 5 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Required information

ubuntu@ubuntu-ThinkPad-P72:~/Downloads lxc-checkconfig Kernel configuration not found at /proc/config.gz; searching... Kernel configuration found at /boot/config-4.18.0-17-generic --- Namespaces --- Namespaces: enabled Utsname namespace: enabled Ipc namespace: enabled Pid namespace: enabled User namespace: enabled Network namespace: enabled

--- Control groups --- Cgroups: enabled

Cgroup v1 mount points: /sys/fs/cgroup/systemd /sys/fs/cgroup/freezer /sys/fs/cgroup/perf_event /sys/fs/cgroup/pids /sys/fs/cgroup/rdma /sys/fs/cgroup/net_cls,net_prio /sys/fs/cgroup/blkio /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/memory /sys/fs/cgroup/cpuset /sys/fs/cgroup/devices /sys/fs/cgroup/hugetlb

Cgroup v2 mount points: /sys/fs/cgroup/unified

Cgroup v1 clone_children flag: enabled Cgroup device: enabled Cgroup sched: enabled Cgroup cpu account: enabled Cgroup memory controller: enabled Cgroup cpuset: enabled

--- Misc --- Veth pair device: enabled, loaded Macvlan: enabled, not loaded Vlan: enabled, not loaded Bridges: enabled, loaded Advanced netfilter: enabled, not loaded CONFIG_NF_NAT_IPV4: enabled, loaded CONFIG_NF_NAT_IPV6: enabled, loaded CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, loaded CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, not loaded FUSE (for use with lxcfs): enabled, not loaded

--- Checkpoint/Restore --- checkpoint restore: enabled CONFIG_FHANDLE: enabled CONFIG_EVENTFD: enabled CONFIG_EPOLL: enabled CONFIG_UNIX_DIAG: enabled CONFIG_INET_DIAG: enabled CONFIG_PACKET_DIAG: enabled CONFIG_NETLINK_DIAG: enabled File capabilities:

Note : Before booting a new kernel, you can check its configuration usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig

ubuntu@ubuntu-ThinkPad-P72:~/Downloads uname -a Linux ubuntu-ThinkPad-P72 4.18.0-17-generic #18-Ubuntu SMP Wed Mar 13 14:34:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@ubuntu-ThinkPad-P72:Downloads cat /proc/self/cgroup 12:hugetlb:/ 11:devices:/user.slice 10:cpuset:/ 9:memory:/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service 8:cpu,cpuacct:/user.slice 7:blkio:/user.slice 6:net_cls,net_prio:/ 5:rdma:/ 4:pids:/user.slice/user-1000.slice/user@1000.service 3:perf_event:/ 2:freezer:/user/ubuntu/0 1:name=systemd:/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service 0::/user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service ubuntu@ubuntu-ThinkPad-P72:~/Downloads$

Issue description

I realize this is somewhere between nothing and crumbs to go forward on, but that's why I'm reaching out. This is logging from the Oracle grid infrastructure install logging. I did not have these problems when I installed on Ubuntu 15.04 LXC 2.x. Any ideas for me to dig further will be appreciated. At this point I don't have any ideas what would be next steps to debug. It' seems to be some permission error, but I don't know what, and perm errors can sometimes be misleading, but in any case I don't know what perms it's referring to. Sorry for such ambiguous error stack please if you can give suggestion for something I can work on myself to try and debug this.

2019-04-06 12:18:14.896 : CLSCEVT:779613952: (:CLSCE0047:)clsce_publish_internal 0xa8dc30 EvmConnCreate failed with status = 13, try = 1 2019-04-06 12:18:14.907 : USRTHRD:951110720: clsncssd_logose: slos [-2], SLOS depend-msg [Permission denied], SLOS error-msg [13] 2019-04-06 12:18:14.907 : USRTHRD:951110720: clsncssd_logose: SLOS other info is [invalid permission].

2019-04-06 12:18:14.907 : USRTHRD:951110720: clsncssd_main: failed to init node reboot. 2019-04-06 12:18:14.907 : AGENT:951110720: Agent is exiting with exit code: -1

A brief description of what failed or what could be improved.

Installation of Oracle Grid Infrastructure in Oracle LInux 7.6 LXC containers running on Ubuntu 18.10 Host OS.

Steps to reproduce

  1. Step one
  2. Step two
  3. Step three

Information to attach

So Oracle Support only has one Bug report that even comes close at all to matching this error stack. Here it is:

"This bug (22393909) is only relevant when using Oracle Real Application Clusters (RAC)

A different container name than the hostname that it is hosted on results in stack not coming up.

Rediscovery Notes Example trace from this issue 2015-12-18 16:03:12.605418 : USRTHRD:650585696: clsncssd_logose: slos [-2], SLOS depend-msg [2], SLOS error-msg [No such f] 2015-12-18 16:03:12.605479 : USRTHRD:650585696: clsncssd_logose: SLOS other info is [invalid permission]. ... 2015-12-18 16:03:12.605531 : USRTHRD:650585696: clsncssd_main: failed to init node reboot. 2015-12-18 16:03:12.605646 : AGFW:650585696: Agent is exiting with exit code: -1

This bug may be suspected if the freezer path returns an "invalid permission" error: /.../crs/trace/ohasd_cssdagent_root.trc 2016-01-21 08:42:26.212 : USRTHRD:3660403264: clsncssd_logose: SLOS other info is [invalid permission].

Workaround Set container and hostname to be same string"

So this is not really practical to set the LXC hostname to be the same as the LXC containername. Moreover, I do not run into this problem (of course) when building on Oracle Linux 7.x LXC Host there are no issues. Since the main mission of Orabuntu-LXC is to allow users to run "any oracle on any linux" in particular Ubuntu Linux, I'm clawing and scratching to try and get at what this error stack actually means, so what I'm looking for is ideas on how to try and diagnose this myself. The error stack itself from Oracle suggests no next steps to me for debugging, but as you can see the author of the bug report indicates something to do with the "freezer path" is involved. Thanks!

gstanden commented 5 years ago

I solved this myself (after thinking about it a bit) and am posting solution or workaround or whatever we might want to call it. Executed these steps on the Ubuntu 18.10 LXC Host:

/etc/init.d/apparmor stop /etc/init.d/apparmor teardown

Then re-launched root.sh execution and then "Hey Mikey! Look! Oracle Grid Infrastructure 18c likes it!" and now Oracle Grid Infrastructure 18c is running happily in Oracle Linux 7.6 LXC containers running ontop of the Ubuntu Linux 18.10 LXC host.

Now my question to you is more specific: Any suggestions on how I can avoid shutting down apparmor entirely, and identify perhaps specific apparmor profile(s) that were offending Oracle Grid Infrastructure process(es) so that I could continue to run apparmor on the Ubuntu 18.10 LXC Host ?

TIA HTH

gstanden commented 5 years ago

So continuing to drill down on this, the more granular solution I am using now is to set the following in the config file of each of the LXC containerized Oracle 18C GI nodes:

lxc.apparmor.allow_incomplete = 1 lxc.apparmor.profile=unconfined

Even this can possibly (?) be made more granular by working with rules in the profiles in /etc/apparmor.d/lxc directory such that it might not be necessary to run the containers fully unconfined. Errors that seem to be the issue (harking back to the "SLOS" message passed back through to GI installer (eg. "SLOS error-msg [13]code") are similar to this (when running the containers confined) from dmesg

[ 25.442749] audit: type=1400 audit(1554911510.861:53): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/dev/shm/" pid=3437 comm="mount" flags="rw, remount" [ 26.123805] audit: type=1400 audit(1554911511.537:54): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/dev/shm/" pid=3574 comm="mount" flags="rw, remount" [ 26.170179] audit: type=1400 audit(1554911511.589:55): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/dev/shm/" pid=3641 comm="mount" flags="rw, remount" [ 26.671124] audit: type=1400 audit(1554911512.085:56): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/dev/shm/" pid=3819 comm="mount" flags="rw, remount" [ 26.825530] RPC: Registered named UNIX socket transport module. [ 26.825531] RPC: Registered udp transport module. [ 26.825532] RPC: Registered tcp transport module. [ 26.825532] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 26.830562] audit: type=1400 audit(1554911512.245:57): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/var/lib/nfs/rpc_pipefs/" pid=3837 comm="mount" fstype="rpc_pipefs" srcname="sunrpc" [ 26.832281] audit: type=1400 audit(1554911512.245:58): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/var/lib/nfs/rpc_pipefs/" pid=3837 comm="mount" fstype="rpc_pipefs" srcname="sunrpc" flags="ro" [ 27.345164] audit: type=1400 audit(1554911512.757:59): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=4115 comm="(ntpd)" flags="rw, rslave" [ 27.628040] audit: type=1400 audit(1554911513.045:60): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/var/lib/nfs/rpc_pipefs/" pid=4266 comm="mount" fstype="rpc_pipefs" srcname="sunrpc" [ 27.639026] audit: type=1400 audit(1554911513.053:61): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/var/lib/nfs/rpc_pipefs/" pid=4266 comm="mount" fstype="rpc_pipefs" srcname="sunrpc" flags="ro" [ 27.754923] audit: type=1400 audit(1554911513.173:62): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/var/lib/nfs/rpc_pipefs/" pid=4321 comm="mount" fstype="rpc_pipefs" srcname="sunrpc"

and when running the containers unconfined, dmesg output has zero DENIED messages (expected) and also shows this:

12 processes are unconfined but have a profile defined. /sbin/dhclient (927) /usr/bin/lxc-start (4057) /usr/bin/lxc-start (5021) /usr/bin/lxc-start (6243) /usr/bin/lxc-start (27414) /usr/sbin/cups-browsed (741) /usr/sbin/cupsd (714) /usr/sbin/ntpd (1212) /usr/sbin/ntpd (4930) /usr/sbin/ntpd (6321) /usr/sbin/ntpd (7661) /usr/sbin/ntpd (27901)

Based on other similar errors that I saw in dmesg when running confined, I experimented with adding such (now commented out) directives as shown below:

ubuntu@ubuntu1810A:/etc/apparmor.d/lxc$ cat lxc-default-cgns -- Do not load this file. Rather, load /etc/apparmor.d/lxc-containers, which -- will source all profiles under /etc/apparmor.d/lxc

profile lxc-container-default-cgns flags=(attach_disconnected,mediate_deleted) { -- include <abstractions/lxc/container-base>

-- the container may never be allowed to mount devpts. If it does, it -- will remount the host's devpts. We could allow it to do it with -- the newinstance option (but, right now, we don't). deny mount fstype=devpts, mount fstype=cgroup -> /sys/fs/cgroup/, mount fstype=cgroup2 -> /sys/fs/cgroup/, -- allow mount fstype=rpc_pipefs, -- allow mount fstype=tmpfs, -- mount options=(rw, bind, ro, remount, rslave), }

and although this solved some of the type 13 errors, it did not solve all of them, and results so far on this that Oracle GI won't start even with these additional mount directives uncommented and implemented, so I went with running unconfined for now as the least reduction in apparmor (only applies to the Oracle GI containers, not to host). It would be even better I think to try to get more granular than that and find ways to accomplish GI startup with the profiles in /etc/apparmor.d/lxc instead of running them unconfined, but at this point still experimenting with what can be done there.

So for now, best solution is to run these containers unconfined. My issue there is that I've found running containers unconfined can have results like this: https://github.com/lxc/lxd/issues/3096

gstanden commented 5 years ago

Since Oracle GI also needs ntp, it seemed prudent to also set ntpd to complain instead of enforce on the Ubuntu LXC host (Reference: https://github.com/lxc/lxc/issues/2108):

sudo aa-complain /usr/sbin/ntpd

9 processes are in complain mode. /usr/bin/lxc-start (2402) /usr/bin/lxc-start (10797) /usr/bin/lxc-start (11972) /usr/bin/lxc-start (13012) /usr/sbin/ntpd (1119) /usr/sbin/ntpd (3279) /usr/sbin/ntpd (11564) /usr/sbin/ntpd (13550) /usr/sbin/ntpd (15050)