rpm-software-management / mock

Mock is a tool for a reproducible build of RPM packages.
GNU General Public License v2.0
387 stars 235 forks source link

mock fails to work in NFS Rootfs mounted Linux #768

Open changyp6 opened 3 years ago

changyp6 commented 3 years ago

Short description of the problem

I have setup a koji-based aarch64 build server, this server uses NFS mount as its rootfs. I also mounted /var/lib/mock and /var/cache/mock to hard disk.

When I run mock, mock hw_plugin reports

"DEBUG: lscpu: failed to determine number of CPUs: /sys/devices/system/cpu/possible: No such file or directory"
"DEBUG: /bin/df: Warning: cannot read table of mounted file systems: No such file or directory"
"DEBUG: Error: /proc must be mounted"

And finally failed with error messages:

ERROR: Command failed: 
 # /bin/mount -n -t tmpfs -o rprivate tmpfs /var/lib/mock/dist-myos-bootstrap/root/proc

After seeing these errors, I started to search for the "results" folder, and I found that "dist-myos-bootstrap" folder is NOT in mounted "/var/lib/mock" folder, instead, the "results" folder is in the NFS rootfs /var/lib/mock folder.

This explains why lscpu failed with error "/sys/devices/system/cpu/possible: No such file or directory", that because /sys has no contents in the NFS rootfs, it is mounted in the running system.

I'm so confused, why mock runs command on top of the "physical" root device instead of the logical root device ?

My logical rootfs has /sys /proc all mounted, and /var/lib/mock mounts to hard disk, /var/cache/mock mounts to hard disk My physical rootfs is a NFS rootfs

Output of rpm -q mock

mock-2.12

Steps to reproduce issue

  1. boot linux into nfs root file system with boot parameters "console=ttyS0 root=/dev/nfs rw rootfstype=nfs nfsroot=NFS_SERVER_IP:/path/to/nfsroot,nolock,vers=4.2 ip=dhcp"
  2. mock -r YOUR_CONFIG --init --trace

Additional Information

  1. /var/lib/mock and /var/cache/mock are all set correctly with 02775 permissions and root:mock ownership.
  2. run lscpu manually can get correct results.
  3. all files in NFS rootfs have correct file permissions
  4. run python3 -c "import subprocess; subprocess.Popen('/usr/bin/lscpu')" in console manually can get the correct result.
changyp6 commented 3 years ago

After 5 days investigation on this issue, I have finally found the root cause. on line 816 of file /usr/libexec/mock/mock 816: unshare_namespace(config_opts) in this function, util.CLONE_NEWNS is used to unshare mount group namespace.

After commenting out line 816, mock works in my NFS rootfs linux system.

I'm not sure if this is a bug, or it is intended to unshare all mount points.

If /var/lib/mock is mounted to other location, after calling "unshare_namespace", all data won't be written to the mounted device of /var/lib/mock, they are written to the original /var/lib/mock folder.

Hope this can help others with this same issue.

Developer of mock, please help investigate on this issue to give a better solution.

praiskup commented 3 years ago

Mock intentionally uses a separate namespace for mountpoints. This is expected.

I fail to see what is happening with the the /var/lib/mock folders in mock, though. Doing unshare - when the mount points are already mounted on the host - should have no effect on the unshared namespace, as the mountpoint list should be copied.

praiskup commented 3 years ago

ERROR: Command failed: $ /bin/mount -n -t tmpfs -o rprivate tmpfs /var/lib/mock/dist-myos-bootstrap/root/proc

Why is this command failing for you? It is obvious it fails, but not why.

changyp6 commented 3 years ago

Mock intentionally uses a separate namespace for mountpoints. This is expected.

I fail to see what is happening with the the /var/lib/mock folders in mock, though. Doing unshare - when the mount points are already mounted on the host - should have no effect on the unshared namespace, as the mountpoint list should be copied.

My builder uses NFS mounted rootfs. Everything of mock starts to work after unshare_namespace(), but after unshare the original mount groups namespace, hw_info plugin fails to work, because /proc is also unshared in the new namespace, lscpu, df, free in the new namespace all failed to work, because /proc is empty in new namespace. when /proc is also unshared, original mountpoints are obviously not copied into the namespace.

I don't know if this issue is related to NFS.

changyp6 commented 3 years ago

Mock intentionally uses a separate namespace for mountpoints. This is expected.

I fail to see what is happening with the the /var/lib/mock folders in mock, though. Doing unshare - when the mount points are already mounted on the host - should have no effect on the unshared namespace, as the mountpoint list should be copied.

My understanding of this reply is that 'Mock should get the originally mounted mountpoints after "unshare_namespace", that's because /proc won't be affected by unshare_namespace' which is unshare_namespace --> read mount points from /proc I think in a hard-drive rootfs, mock works like this, and it is indeed working.

In my case, rootfs is NFS 'After unshare_namespace, /proc is also unshared in the separated namespace, nothing of the originally mounted mountpoints can be obtained.' which is unshare_namespace --> /proc is unshared too --> cannot get mountpoints or read system information

changyp6 commented 3 years ago

My system is running kernel 5.14.32, with glibc 2.34, python is 3.10.0 rc1

praiskup commented 3 years ago

My understanding of this reply is that 'Mock should get the originally mounted mountpoints after "unshare_namespace", that's because /proc won't be > affected by unshare_namespace'

Yes, that's what I meant. The effect of unsharing should be that we don't propagate mount events from the new namespace up to the parent namespace, but the list of mounts sholdn'ŧ be affected IMO, per `mount_namespaces(7):

  *  If the namespace is created using unshare(2), the mount  point  list
     of  the  new  namespace  is  a  copy  of the mount point list in the
     caller's previous mount namespace.

I'm no sure what is going on, but it isn't trivial for me to setup a rootfs on NFS to test this scenario. What propagation options is on your /proc (etc), and what happens when you unshare manually (with /usr/bin/unshare)?

changyp6 commented 3 years ago

My understanding of this reply is that 'Mock should get the originally mounted mountpoints after "unshare_namespace", that's because /proc won't be > affected by unshare_namespace'

Yes, that's what I meant. The effect of unsharing should be that we don't propagate mount events from the new namespace up to the parent namespace, but the list of mounts sholdn'ŧ be affected IMO, per `mount_namespaces(7):

  *  If the namespace is created using unshare(2), the mount  point  list
     of  the  new  namespace  is  a  copy  of the mount point list in the
     caller's previous mount namespace.

I'm no sure what is going on, but it isn't trivial for me to setup a rootfs on NFS to test this scenario. What propagation options is on your /proc (etc), and what happens when you unshare manually (with /usr/bin/unshare)?

I tried to modify /usr/libexec/mock/mock

  1. If remove COPY_NEWUTS from the extented_unshare_flags, only unshare mount pointsNEWNS, everything works
  2. If unshare NEWNS and NEWUTS separately(call unshare(CLONE_NEWNS) and unshare(CLONE_NEWUTS)), no matter which is first, NEWUTS will always fail, however, /proc still exists, mock works
  3. If unshare COPY_NEWNS first, unshare will reports OK, then unshare CLONE_NEWNS | CLONE_NEWUTS, unshare will fail, and /proc no longer exists, mock fails to work
  4. If unshare COPY_NEWNS then COPY_NEWNS | COPY_NEWUTS, then CLONE_NEWNS, /proc still missing

After these tesing, I found that if calling unshare COPY_NEWNS | COPY_NEWUTS failed, mount points no longer exists in the following environment, calling unshare (COPY_NEWNS) doesn't help, won't bring /proc back.

Then I did another test on my system, by running the following command

$sudo unshare -u
unshare: unshare failed: Invalid argument

My system will ask for a specific argument for UTS unshare operation

Above are my findings, hope these will help you locate the issue.

My theory is that, when calling unshare(CLONE_NEWNS | CLONE_NEWUTS) together, if it fails, the following system is already in a new environment, however, this environment doesn't have mount_namespaces and UTS_namespaces from the parent system. I seems that when unshare failed, everythings fail together, but still give you a new environment. In this environment, calling unshare(CLONE_NEWNS) again won't help, because in this environment, mount_namespace doesn't exist, that's why everything in /proc is empty.

In the example program of unshare() man pages, if unshare() failed, the program just exit directly. How to handle unshare failure, is not explained in the man page.

changyp6 commented 3 years ago

I have finally found the root cause of this issue!!!!

if unshare(CLONE_NEWUTS | CLONE_NEWNS) failed, even if unshare(CLONE_NEWNS) succeeded, the new mount_namespace is empty

The question is why unshare(CLONE_NEWUTS | CLONE_NEWNS) fails. The answer is IPC_namespace is NOT enabled in kernel

To solve such mock issue, just to make sure that IPC namespace and MOUNT namespace are all enabled in kernel.

However, the logic in mock code is still wrong, mock should exit immediately if unshare(CLONE_NEWUTS | CLONE_NEWNS) fails. Any successful unshare operation after a failed unshare, is not working for mock

And I suggest mock record this into mock's document, to require IPC namespace and mount namespace features in kernel.

praiskup commented 3 years ago

Thank you for the info, @changyp6 - I'll keep this open.

However, the logic in mock code is still wrong, mock should exit immediately if unshare(CLONE_NEWUTS | CLONE_NEWNS) fails.

We really need to take a look at the unshare logic.