ArtemOnanko commented 1 year ago

Hi @sfjro,

Could you please take a look a the strange behavior I've encountered when trying to use aufs5.19 vs Kernel 5.19.15?

I applied AUFS patches on kernel5.19.15：

patch "aufs5-base.patch" patch "aufs5-mmap.patch" patch "aufs5-standalone.patch" patch "aufs5-loopback.patch" patch "aufs5-tmpfs-idr.patch" patch "aufs5-vfs-ino.patch" patch "aufs5-lockdep-debug.patch" (also applied aufs5-loopback.patch to aufs sources)

Then operate in initrd:

insmod /modules/aufs.ko … mkdir /initrd mkdir /initrd/root.ro mkdir /initrd/root.rw mkdir /initrd/fixedram

echo " Mounting /initrd/root.ro" mount -t ext3 -o noatime,rw,data=ordered /dev/sda5 /initrd/root.ro (sda5 is a system partition)

echo " Mounting tmpfs /initrd/fixedram /initrd/fixedram, and preallocated loopback root.rw" mount -t tmpfs /initrd/fixedram /initrd/fixedram dd if=/dev/zero of=/initrd/fixedram/root.rw bs=1024 count=1024 mke2fs -b 1024 -F /initrd/fixedram/root.rw 1G > /dev/null

mknod /dev/loop9 b 7 9 echo " attempting to lostup /dev/loop9" losetup /dev/loop9 /initrd/fixedram/root.rw

echo " attemping mount of /dev/loop to /initrd/root.rw" mount -t ext2 /dev/loop9 /initrd/root.rw

echo " Mounting -t aufs root.union ." mkdir /newroot

mount -t aufs -o "udba=none,dirs=/initrd/root.rw=rw:/initrd/root.ro=ro" root.union /newroot

Everything looks fine:

Now we need to switch to a new root. Originally the problem was noticed after the busybox switch_root command:

exec switch_root -c /dev/console /newroot /sbin/init $@

however, the problematic behavior could be seen after chroot command as well:

chroot /newroot

I’d tried to check the lsmod and noticed a strange number in the “Used by” field for aufs:

After another lsmod the “Used by” number had increased:

It looks like the /sys/module/aufs/refcnt increases after a process try to access shared libraries on the disk:

On the real system this leads to the refcnt overflow after switch_root, causing Kernel “VFS: file-max limit 813078 reached” messages.

This behavior looks unexpected and was not observed when using aufs4.14 (with kernel4.14) with the same settings.

More info: aufs_info.tar.gz

Thanks, Artem

sfjro commented 1 year ago

Hello Artem,

ArtemOnanko:

On the real system this leads to the refcnt overflow after switch_root, c= ausing Kernel =E2=80=9CVFS: file-max limit 813078 reached=E2=80=9D messag= es. =0D =0D =0D This behavior looks unexpected and was not observed when using aufs4.14 (= with kernel4.14) with the same settings.=0D

Thanx for the report. I need more info. Can you try these?

lsof(1) It will show you which files are in use. Can you see anythin unusual?
Magic SysRq key Enable CONFIG_AUFS_MAGIC_SYSRQ, which requires both of CONFIG_AUFS_DEBUG and CONFIG_MAGIC_SYSRQ. And press SysRq + A. It will print many kernel logs about aufs including which files are in use. (Accordint to dmesg in your aufs_info.tar.gz, your kernel seems to disable CONFIG_AUFS_DEBUG)

And you have mounted aufs with udba=notify, but your /proc/mounts shows udba=none. Do you know why?

J. R. Okajima

ArtemOnanko commented 1 year ago

@sfjro, thanks for the quick response

Regarding the udba=none. Initially, I tried to use udba=notify, like it was for the older aufs4.14, but later I changed the setting to less strict (as I understood from the man page) udba=none trying to resolve the problem. This didn't help so I kept using udba=none in my tests. I've edited my first comment.

I've recompiled aufs with CONFIG_AUFS_DEBUG = y and enabled debugging during insmod.

I tried lsof at the beginning of my investigation but did notice nothing suspicious. However, I've captured 3 more lsof outputs before capturing dmesg

and attached the result (chroot was called on the 60-th second in dmesg.txt): dmesg.txt lsof1.txt lsof2.txt lsof3.txt

Is there anything I can add additionally?

Regards, Artem

sfjro commented 1 year ago

ArtemOnanko:

Regarding the udba=none. Initially, I tried to use udba=notify, like it was for the older aufs4.14, but later I changed the setting to less strict (as I understood from the man page) udba=none trying to resolve the problem. This didn't help so I kept using udba=none in my tests. I've edited my first comment.

So there is no unmatch between your command line and the output, right?

I could see another unmatching thing in your first mail.

mount -t ext3 -o noatime,rw,data=3Dordered /dev/sda5 /initrd/root.ro
mount -t ext2 /dev/loop9 /initrd/root.rw

The paths are /initrd/root.r[ow], but the output (image file) says they are /newroot/aufs/root.r[ow]. If you already have done "mount --move", it is ok. But I'd suggest you to "mount --move /initrd/fixedram /newroot/aufs/fixedram" too before switch_root. I am not sure this is releated to the problem though.

I've recompiled aufs with CONFIG_AUFS_DEBUG = y and enabled debugging during insmod. ::: Is there anything I can add additionally?

As you wrote, lsof doesn't help. I'd ask you to try MagicSysRq + A.

J. R. Okajima

ArtemOnanko commented 1 year ago

Hi, @sfjro,

Yes, I have the next "mount --move" commands coming after mount -t aufs -o "udba=none,dirs=/initrd/root.rw=rw:/initrd/root.ro=ro" root.union /newroot but before switch_root in my init script:

mount --move /initrd/root.ro /newroot/aufs/root.ro mount --move /initrd/root.rw /newroot/aufs/root.rw mount --move /sys /newroot/sys mount --move /proc /newroot/proc mount --move /dev /newroot/dev

I've added "mount --move /initrd/fixedram /newroot/aufs/fixedram" as you suggested too: No changes to behavior were noticed.

I've recompiled aufs with CONFIG_AUFS_MAGIC_SYSRQ = y, and triggered aufs sysrq: The results are attached: dmesg_sysrq.txt

Thanks, Artem

sfjro commented 1 year ago

ArtemOnanko:

I've recompiled aufs with CONFIG_AUFS_MAGIC_SYSRQ = y, and triggered aufs sysrq: The results are attached: dmesg_sysrq.txt

OK, thanks. I know you did lsmod in the very early stage and the command exited successfully. So 'ps' command never show you 'lsmod'. And when you push MagicSysRq + A keys, you didn't run lsmod, right?

But I can see a few lsmod in your dmesg_sysrq.txt, which means the binary file is still opened and stays on memory. This is obviously wrong. When the process exits, all the opened files including the command file itself should be closed (except the shared ones) and all the related resources should be freed. Now I understand this is the first clue for my investigation. Gimme some time.

And let me make sure one more thing. Your kernel is vanilla v5.19.15, not ubuntu (or any other distribution's) kernel, right? Did you apply some other patches than aufs?

J. R. Okajima

ArtemOnanko commented 1 year ago

OK, thanks. I know you did lsmod in the very early stage and the command exited successfully. So 'ps' command never show you 'lsmod'. And when you push MagicSysRq + A keys, you didn't run lsmod, right?

Yes, that's right.

And let me make sure one more thing. Your kernel is vanilla v5.19.15, not ubuntu (or any other distribution's) kernel, right? Did you apply some other patches than aufs?

Yes, vanilla v5.19.15, however, I have a few more patches applied. I'm going to disable them and provide you with the result asap.

Artem

ArtemOnanko commented 1 year ago

@sfjro

I can confirm the same behavior with vanilla Kernel 5.19.15 and only aufs patches mentioned in my first comment applied.

Regards, Artem

sfjro commented 1 year ago

ArtemOnanko:

I can confirm the same behavior with vanilla Kernel 5.19.15 and only aufs patches mentioned in my first comment applied.

Thanx for the test. On my side I had tested linux-v5.19.15 + aufs5.19 too, and it passed all my local tests... So there must be some important differeces from your environment and mine. What I currently am aware is

mine is latest aufs5.15
my root filesystem is not aufs. I mean
- real machine
- normal boot
- no initramfs
- no busybox
- aufs module has more debug print
- test using chroot (instead of switch_root)

I will try more on my side. If you can, I'd ask you to test again without udba=none.

J. R. Okajima

sfjro commented 1 year ago

I will try more on my side. If you can, I'd ask you to test again without udba=none.

I'm still testing but cannot reproduce the problem. One very strange thing in your report is "VFS: file-max limit 813078 reached" log and "fs.file-nr = 64 0 813078" from sysctl fs.file-nr.

64 here is the number of current allocated (opened) files. the number is incremented in allocating a file handle, and decremented in freeint the file handle. so your 64 is not so crazy.
813078 is the max number of file handles. generally it is much larger than 813078. on my test system it is 9223372036854775807 and the ram size is 4g. but I am not sure this is the cause of the problem. Did your old v4.14 system has such small number too?

The kernel compares these two numbers and the log is produced when the number of open files exceeds the max number. Obviously 64 is smaller than 813078. I'm wondering why did your system hit the maximum...

J. R. Okajima

ArtemOnanko commented 1 year ago

Hi @sfjro,

mine is latest aufs5.15 - my root filesystem is not aufs. I mean + real machine + normal boot + no initramfs + no busybox + aufs module has more debug print + test using chroot (instead of switch_root)

Should I try the latest aufs5.15 too?

I have a real server that behaves the same, however, I conduct my tests in ESXi VM for convenience.

I'm using a very old busybox 1.19, I'll try a newer one (maybe the mount and chroot wrappers had been updated)

I'm also using chroot in my tests.

If you can, I'd ask you to test again without udba=none.

I've changed the mount command to mount -t aufs -o "dirs=/initrd/root.rw=rw:/initrd/root.ro=ro" root.union /newroot dmesg_no_udba_sysrq.txt

One very strange thing in your report is "VFS: file-max limit 813078 reached" log and "fs.file-nr = 64 0 813078" from sysctl fs.file-nr.

I crafted this line for a better understanding. On the real server, the file-max is bumped to "fs.file-max = 3273086" so the Kernel warning message looks like this: Dec 22 10:42:00 kernel: [167225.737588] VFS: file-max limit 3273083 reached

On the real server, the file-nr is increasing fast (due to the number of working processes), so this warning could be seen after a day or two. This is to demonstrate how fast the file-nr and refcnt is increasing:

[root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 643328 0 3273086 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 643664 0 3273086 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 643888 0 3273086 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 323187 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 323238 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 323266

There was approximately 1 second between the commands.

Did your old v4.14 system has such small number too?

Yes, but refcnt and consequently file-nr is not growing all the time with Kernel 4.14:

[root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 17920 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 17808 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 17808 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 18032 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 18144 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 18144 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 18144 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 18144 0 3233848 [root@ELX-20-HSW ~]# sysctl fs.file-nr fs.file-nr = 18032 0 3233848 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6835 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6845 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6835 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6835 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6835 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6835 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6835 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6872 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6872 [root@ELX-20-HSW ~]# cat /sys/module/aufs/refcnt 6835

Should I attach the magic sysrq test for aufs 4.14, maybe?

Thanks, Artem

sfjro commented 1 year ago

ArtemOnanko:

Should I try the latest aufs5.15 too?

Currently I don't think it necessary because the change between latest and your version is rather minor, and I don't think it related.

I'm using a very old busybox 1.19, I'll try a newer one (maybe the mount and chroot wrappers had been updated)

Hmm, I'd love to hear the good news after your trial.

If you can, I'd ask you to test again without udba=none.

I've changed the mount command to mount -t aufs -o "dirs=/initrd/root.rw=rw:/initrd/root.ro=ro" root.union /newroot dmesg_no_udba_sysrq.txt

Still some exited lsmod appears... sigh.

I crafted this line for a better understanding. On the real server, the file-max is bumped to "fs.file-max = 3273086" so the Kernel warning message looks like this: Dec 22 10:42:00 kernel: [167225.737588] VFS: file-max limit 3273083 reached

On the real server, the file-nr is increasing fast (due to the number of working processes), so this warning could be seen after a day or two. This is for the demo of how fast the file-nr is increasing:

I understand about file-max, and I don't think it a problem. The problem is still why all(?) files are not closed properly.

Are there any unusal file handling on your server, such like a caching daemon, LSM, or anything intercept the file access and share the files? If some daemon issue mmap(MAP_SHARED) to the opened files, then such problem may happen.

Or manupilating mount tree and making the RO branch recursive, such operation MAY cause a problem like yours. But your /proc/mounts doesn't show such irregular mount tree. Your is pretty normal.

Should I attach the magic sysrq test for aufs 4.14, maybe?

Currently I don't think it necessary.

I appreciate your repeated tests. Still I am testing on my side, but cannot reproduce the problem. I will think more.

J. R. Okajima

sfjro commented 1 year ago

I'm using a very old busybox 1.19, I'll try a newer one (maybe the mount and chroot wrappers had been updated)

Hmm, I'd love to hear the good news after your trial.

Re-reading your report, I've found something weird.

Now we need to switch to a new root. Originally the problem was noticed a= fter the busybox switch_root command:=0D =0D exec switch_root -c /dev/console /newroot /sbin/init $@=0D =0D however, the problematic behavior could be seen after chroot command as w= ell:=0D =0D chroot /newroot=0D

Does your script invoke both of switch_root and chroot? If so, it is wrong. The last command in the script should be "exec switch_root ..." and the command should never return to the script. If it returns, then it means switch_root failed.

J. R. Okajima

sfjro commented 1 year ago

Does your script invoke both of switch_root and chroot? If so, it is wrong. The last command in the script should be "exec switch_root ..." and the command should never return to the script. If it returns, then it means switch_root failed.

If your case is

the last command in the script is switch_root.
chroot is executed manually. then what will happen exiting the chroot-ed shell?

J. R. Okajima

ArtemOnanko commented 1 year ago

Does your script invoke both of switch_root and chroot? If so, it is wrong.

No-no, in the script I have only exec switch_root as the last command. I meant that the issue could be reproduced by calling chroot manually as well.

then what will happen exiting the chroot-ed shell?

This is a good question, I'll check the behavior, I suppose I'll need to mount --move /newroot/sys /sys back in order to check the refcnt in this case?

Regards, Artem

sfjro commented 1 year ago

ArtemOnanko:

This is a good question, I'll check the behavior, I suppose I'll need to mount --move /newroot/sys /sys back in order to check the refcnt in this case?

Generally, no you don't have to mount --move back after chroot. Even before chroot, mount --move is unnecessary. Instead try mount -o bind. Here is a scenario I am assuming.

(init script in initramfs)

mount branches
mount aufs (at /newroot)
mount --move everything
switch_root (to /newroot)

(after logging-in, sessionA)

some commands if you want

(no new aufs mount at all.)

for i in /sys /proc /dev /anything_you_want

do mount -o bind $i /another_newroot$i done

chroot /another_newroot

(in the chroot-ed shell)

some commands if you want

sysctl fs.file-nr

exit

(back to sessionA)

for i in /sys /proc /dev /anything_you_want

do umount /another_newroot$i done

sysctl fs.file-nr

J. R. Okajima

ArtemOnanko commented 1 year ago

Hi @sfjro,

I've built and tried the latest busybox 1.36 (I can share .config and inittab if required), however, this didn't change the aufs behavior after "chroot /newroot" command which was executed instead of "exec switch_root" as the last command in the busybox init script - the refcnt continue to grow after each "cat /sys/modules/aufs/refcnt" call.

Then I tried your scenario (with exec switch_root and a new busybox):

(after logging-in, sessionA) # some commands if you want (no new aufs mount at all.) # for i in /sys /proc /dev /anything_you_want do mount -o bind $i /another_newroot$i done # chroot /another_newroot

(in the chroot-ed shell) # some commands if you want # sysctl fs.file-nr # exit

(back to sessionA) # for i in /sys /proc /dev /anything_you_want do umount /another_newroot$i done # sysctl fs.file-nr

Also attaching sysrq dmesg logs from the chrooted shell: dmesg_another_newroot.txt

And from the sessionA after return and umount (/bin and /lib64 were unable to unmount): dmesg_return_sessionA.txt

The refcnt continues to grow even faster within the new chroot-ed shell with each "cat /sys/modules/aufs/refcnt" command.

Thanks, Artem

sfjro commented 1 year ago

ArtemOnanko:

I've built and tried the latest busybox 1.36 (I can share .config and inittab if required), however, this didn't change the aufs behavior after "chroot /newroot" command which was executed instead of "exec switch_root" as the last command in the busybox init script - the refcnt continue to grow after each "cat /sys/modules/aufs/refcnt" call.

No. The last command in your init script has to be "exec switch_root".

On my side, I cannot reproduce the issue. After "exit" from the chroot-ed shell, the first number of fs.file-nr is back (to the smaller number).

J. R. Okajima

sfjro commented 1 year ago

The last command in your init script has to be "exec switch_root".

Please try this patch.

apply the patch and rebuild your kernel
boot with "exec switch_root"
run "sysctl fs.file-nr; sleep 10; sysctl fs.file-nr"

The purpose of this patch is to find out which process keeps opening a file.

The patch produces many kernel logs containing a PID and a filename per a line. "sysctl fs.file-nr" is the trigger to produce the log, and after printing, the internal state is cleared. So the second "sysctl fs.file-nr" will print the PIDs who has opened any file during "sleep 10". I need to see what the second "sysctl fs.file-nr" prints. Only the second one. Send me the second log. If the log is too small, try "sleep 30" or longer.

Because this patch allocates memory every time any file is opened and frees the memory when "sysctl fs.file-nr" is executed (and the file is closed), long lived system without "sysctl fs.file-nr" MAY consume the memory too much and you MAY see another problem. So the debugging purpose only.

J. R. Okajima

sfjro commented 1 year ago

Please try this patch.

apply the patch and rebuild your kernel

boot with "exec switch_root"

run "sysctl fs.file-nr; sleep 10; sysctl fs.file-nr"

Again github ate the attachment. This time I include the patch as a plain text.

J. R. Okajima

diff --git a/fs/aufs/i_op.c b/fs/aufs/i_op.c index dd59340637cf..7c33fdb7466e 100644 --- a/fs/aufs/i_op.c +++ b/fs/aufs/i_op.c @@ -347,6 +347,7 @@ static int aufs_atomic_open(struct inode dir, struct dentry dentry, struct aopen_node aopen_node = { .file = file }; +extern void DBGopened(struct dentry *dentry);

IMustLock(dir);
AuDbg("open_flag 0%o\n", open_flag);

@@ -403,6 +404,7 @@ static int aufs_atomic_open(struct inode dir, struct dentry dentry, err = PTR_ERR(args.file); if (IS_ERR(args.file)) goto out_parent; +DBGopened(dentry);

bindex = au_dbtop(dentry);
err = au_aopen_or_create(dir, dentry, &args);

diff --git a/fs/file_table.c b/fs/file_table.c index d2fe19f88894..4b8d901c5d3a 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -33,6 +33,43 @@

include "internal.h"

+struct DBGpair {

struct hlist_bl_node hlist;
pid_t pid;
struct dentry *dentry; +}; +static struct hlist_bl_head DBGpairs;
+void DBGopened(struct dentry *dentry) +{
struct DBGpair *p;
p = kmalloc(sizeof(*p), GFP_NOFS);
if (p) {
//INIT_HLIST_BL_NODE(p->hlist);
p->dentry = dentry;
p->pid = current->pid;
hlist_bl_lock(&DBGpairs);
hlist_bl_add_head(&p->hlist, &DBGpairs);
hlist_bl_unlock(&DBGpairs);
} else
pr_err_once("kmalloc failed\n"); +} +EXPORT_SYMBOL_GPL(DBGopened); +void DBGclosed(struct dentry *dentry) +{
struct DBGpair *cur;
struct hlist_bl_node pos, tmp;
hlist_bl_lock(&DBGpairs);
hlist_bl_for_each_entry_safe(cur, pos, tmp, &DBGpairs, hlist) {
if (cur->dentry != dentry || cur->pid != current->pid)
continue;
hlist_bl_del(&cur->hlist);
kfree(cur);
}
hlist_bl_unlock(&DBGpairs); +}
/ sysctl tunables... / static struct files_stat_struct files_stat = { .max_files = NR_FILE @@ -85,6 +122,19 @@ static int proc_nr_files(struct ctl_table table, int write, void buffer, size_t lenp, loff_t ppos) { files_stat.nr_files = get_nr_files();
{
struct DBGpair *cur;
struct hlist_bl_node pos, tmp;
hlist_bl_lock(&DBGpairs);
hlist_bl_for_each_entry_safe(cur, pos, tmp, &DBGpairs, hlist) {
//pr_info("pid: %d\n", cur->pid);
pr_info("pid=%d, %pd2\n", cur->pid, cur->dentry);
hlist_bl_del(&cur->hlist);
kfree(cur);
}
hlist_bl_unlock(&DBGpairs);
} return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); }

@@ -231,6 +281,7 @@ static struct file alloc_file(const struct path path, int flags, if (IS_ERR(file)) return file;

+DBGopened(path->dentry); file->f_path = path; file->f_inode = path->dentry->d_inode; file->f_mapping = path->dentry->d_inode->i_mapping; @@ -328,6 +379,7 @@ static void __fput(struct file file) put_write_access(inode); mnt_drop_write(mnt); } +DBGclosed(dentry); dput(dentry); if (unlikely(mode & FMODE_NEED_UNMOUNT)) dissolve_on_fput(mnt); @@ -415,6 +467,7 @@ void init files_init(void) filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); percpu_counter_init(&nr_files, 0, GFP_KERNEL); +INIT_HLIST_BL_HEAD(&DBGpairs); }

/ diff --git a/fs/namei.c b/fs/namei.c index 7a5992805583..82d7095249d0 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3639,6 +3639,7 @@ static struct file path_openat(struct nameidata nd, { struct file file; int error; +extern void DBGopened(struct dentry *dentry);

file = alloc_empty_file(op->open_flag, current_cred());
if (IS_ERR(file))

@@ -3658,6 +3659,7 @@ static struct file path_openat(struct nameidata nd, terminate_walk(nd); } if (likely(!error)) { +DBGopened(file->f_path.dentry); if (likely(file->f_mode & FMODE_OPENED)) return file; WARN_ON(1); diff --git a/fs/open.c b/fs/open.c index 50d04ab40175..0cdd4c5a78f2 100644 --- a/fs/open.c +++ b/fs/open.c @@ -982,6 +982,7 @@ int vfs_open(const struct path path, struct file file) return do_dentry_open(file, d_backing_inode(path->dentry), NULL); }

+extern void DBGopened(struct dentry dentry); struct file dentry_open(const struct path path, int flags, const struct cred cred) { @@ -1000,6 +1001,7 @@ struct file dentry_open(const struct path path, int flags, fput(f); f = ERR_PTR(error); } +else DBGopened(f->f_path.dentry); } return f; } @@ -1043,6 +1045,7 @@ struct file dentry_create(const struct path path, int flags, umode_t mode, fput(f); return ERR_PTR(error); } +else DBGopened(f->f_path.dentry); return f; } EXPORT_SYMBOL(dentry_create); @@ -1060,6 +1063,7 @@ struct file open_with_fake_path(const struct path path, int flags, fput(f); f = ERR_PTR(error); } +else DBGopened(f->f_path.dentry); } return f; }

ArtemOnanko commented 1 year ago

Hi @sfjro,

Please try this patch. - apply the patch and rebuild your kernel - boot with "exec switch_root" - run "sysctl fs.file-nr; sleep 10; sysctl fs.file-nr"

here is the log attached: dmesg_file-nr_patch.txt

Thanks, Artem

sfjro commented 1 year ago

ArtemOnanko:

here is the log attached: dmesg_file-nr_patch.txt

Thanx.

[ 69.311428] pid=201, fs/file-nr [ 69.311437] pid=201, lib64/libc-2.17.so [ 69.311439] pid=201, lib64/libc-2.17.so [ 69.311440] pid=201, lib64/libresolv-2.17.so [ 69.311442] pid=201, lib64/libresolv-2.17.so [ 69.311444] pid=201, lib64/libm-2.17.so [ 69.311446] pid=201, lib64/libm-2.17.so [ 69.311447] pid=201, lib64/ld-2.17.so [ 69.311449] pid=201, lib64/ld-2.17.so [ 69.311450] pid=201, bin/busybox [ 69.311451] pid=201, bin/busybox [ 69.311453] pid=200, lib64/libc-2.17.so [ 69.311455] pid=200, lib64/libc-2.17.so [ 69.311456] pid=200, lib64/libresolv-2.17.so [ 69.311458] pid=200, lib64/libresolv-2.17.so [ 69.311459] pid=200, lib64/libm-2.17.so [ 69.311461] pid=200, lib64/libm-2.17.so [ 69.311462] pid=200, lib64/ld-2.17.so [ 69.311464] pid=200, lib64/ld-2.17.so [ 69.311465] pid=200, bin/busybox [ 69.311466] pid=200, bin/busybox

According to the log,

in this 10 seconds, two processes opened some files.
pid 201 must be "sysctl fs.file-nr" because it opens fs/file-nr. And these pid=201 lines shoule be all gone when you run "sysctl fs.file-nr" next time. Similar lines should appear, but pid must be different from 201.
what is pid 200? Is your "sysctl" implemented as busybox applet? If so, it is OK. It means that in this 10 seconds, there is no mysterious file open. The first number of fs.file-nr should not be increased.

Your two outputs from fs.file-nr (before and after 10 seconds) are identical, right? If so, I'd ask you to one more test, sysctl fs.file-nr; sleep 30; sysctl fs.file-nr; sysctl fs.file-nr

J. R. Okajima

ArtemOnanko commented 1 year ago

@sfjro ,

Yes, sysctl is a busybox applet. I'm not sure what is pid 200, I'll track it down. The two outputs for the 10 seconds case were the same, however, the outputs for 30 seconds were different :

The log is attached: dmesg_file-nr_patch_30.txt

Artem

sfjro commented 1 year ago

ArtemOnanko:

Yes, sysctl is a busybox applet. I'm not sure what is pid 200, I'll track it down.

I guess pid 200 busybox forks itself and creates pid 201. It is to run sysctl applet. I don't think it a problem.

The two outputs for the 10 seconds case were the same, however, the outputs for 30 seconds were different :

?? But the debug log doesn't show that no files were opened in 30 seconds. Why the file-nr number increased?? Who touched fs/file-table.c:nr_files counter? Is your kernel really plain linux-v5.19.15 + aufs5.19?

J. R. Okajima

ArtemOnanko commented 1 year ago

Yes, I have plain linux 5.19.15+ busybox 1.36+ aufs5.19.

I don't know how to explain fs/file-table.c:nr_files counter behavior.

I've tried calling sysctl fs.file-nr more than 2 times and it looks like there is some pattern - fs/file-table.c:nr_files counter increases after a few sysctl calls: dmesg_file-nr_10_mult.txt

However, this pattern could be broken by calling something else:

This behavior is the reason why I opened this issue. My initial guess was that the reason for refcnt incrementation is that somewhere in Kernel sources aufs kobject is incremented and this leads (somehow, indirectly) to file-nr growth ( lsof was not showing any new files opened so that growth should come from inside the Kernel), but I'm not a kernel developer.

Any ideas, on what else can I do to narrow down the issue?

Here attaching "strace sysctl" from 4.14 and 5.19 for comparison, just in case: strace_sysctl_4.14.txt strace_sysctl_5.19.txt

Thanks for your help, Artem

sfjro commented 1 year ago

ArtemOnanko:

Any ideas, on what else can I do to narrow down the issue?

Please try inserting one line to fs/file_table.c.

121 static int proc_nr_files(struct ctl_table table, int write, void buffer, 122 size_t lenp, loff_t ppos) 123 { 124 files_stat.nr_files = get_nr_files(); 125 { 126 struct DBGpair cur; 127 struct hlist_bl_node pos, *tmp; 128
-->129 files_stat.nr_files = percpu_counter_sum_positive(&nr_files); 130 hlist_bl_lock(&DBGpairs);

By adding percpu_counter_sum_positive(), the counter value becomes accurate. and sysctl fs.file-nr shows the accurate number. And I hope the number stop increasing mysteriously. If it stops, then it means you need to increase fs.file-max manually.

J. R. Okajima

ArtemOnanko commented 1 year ago

Hi @sfjro,

The counter doesn't stop, unfortunately: dmesg.txt

Artem

sfjro commented 1 year ago

ArtemOnanko:

The counter doesn't stop, unfortunately:

Hmm, but the number increases by 20 precisely. Is it a good sign? I don't know.

I'd suggest you to increase fs.file-max much larger, and try your normal operation as a test. Let's see when the file-nr number decreases, shall we?

J. R. Okajima

ArtemOnanko commented 1 year ago

Hi @sfjro,

I've increased file-max on my test server to 9223372036854775807 and wait for the result.

BTW, sysctl, when called manually without sleep, results in the increase of file-nr by 10:

The increase by 20 was because both sysctl and sleep are the busybox calls:

Calling any other process results in a different file-nr increase, for example:

Instead of 535 we have 531 here.

It looks like /sbin/module/aufs/refcnt (or aufs Used by column value in lsmod output) number is increased after each new process is called by the number of the process's open syscalls:

lsmod process

aufs refcnt is increased by 3

sysctl process

aufs refcnt is increased by 5 (for sysctl) + 3 (for lsmod) = 8

Meanwhile, it looks file-nr is growing two times faster than refcnt:

sysctl case

file-nr growth equals 5*2 = 10

lsmod case

file-nr growth equals 3x2 (for lsmod) + 5x2 (for sysctl) = 16

Maybe this is just a coincidence, but I hope this could help in understanding the issue.

Regards, Artem

sfjro commented 1 year ago

ArtemOnanko:

I've increased file-max on my test server to 9223372036854775807 and wait for the result.

It seems that you failed increasing coz the last number of fs.file-nr is 812999. (But it is not a big problem now)

BTW, sysctl, when called manually without sleep, results in the increase of file-nr by 10: ::: The increase by 20 was because both sysctl and sleep are the busybox calls:

I see.

Meanwhile, it looks file-nr is growing two times faster than refcnt:

If you open one aufs (virtual) file, aufs opens one real file on branch fs internally. So it becomes two.

When any process exits, the kernel frees all resources which are acquired and used by the process. In other words, all files the process has opened are closed and any resources to the file are freeed, unless the file is shared by other process(es). It is a very basic behaviour but your system doesn't behave such like that. I don't know what is happening.

J. R. Okajima

sfjro commented 1 year ago

"J. R. Okajima":

When any process exits, the kernel frees all resources which are acquired and used by the process. In other words, all files the process has opened are closed and any resources to the file are freeed, unless the file is shared by other process(es). It is a very basic behaviour but your system doesn't behave such like that. I don't know what is happening.

If you can, please apply this patch and see whether a kernel warning is produced or not.

J. R. Okajima

diff --git a/fs/aufs/file.c b/fs/aufs/file.c index 283c3413e09b..3cc171151966 100644 --- a/fs/aufs/file.c +++ b/fs/aufs/file.c @@ -86,6 +86,7 @@ struct file au_h_open(struct dentry dentry, aufs_bindex_t bindex, int flags, au_lcnt_inc(&br->br_nfiles); h_path.dentry = h_dentry; h_path.mnt = au_br_mnt(br); +WARN_ON_ONCE(!!(flags & FMODE_NOACCOUNT)); h_file = vfsub_dentry_open(&h_path, flags); if (IS_ERR(h_file)) goto out_br; diff --git a/fs/file_table.c b/fs/file_table.c index 8bc937d1aea7..5167f741af17 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -236,6 +236,7 @@ struct file alloc_empty_file(int flags, const struct cred cred) }

f = __alloc_file(flags, cred);

+WARN_ON_ONCE(!!(flags & FMODE_NOACCOUNT)); if (!IS_ERR(f)) percpu_counter_inc(&nr_files);

ArtemOnanko commented 1 year ago

Hi @sfjro ,

I've recompiled both Kernel and aufs.ko with these patches - no warnings are observed.

Regards, Artem

sfjro commented 1 year ago

ArtemOnanko:

I've recompiled both Kernel and aufs.ko with these patches - no warnings are observed.

Thanx. It is a progress anyway.

May I continue sending a patch and ask you to test? This patch prints an opened file's reference counter of a process named "sleep". The number should be 1 in the end, and the file is freed. such like this.

$ sleep 1 fput:435:sleep[1103]: bin/sleep, ext4, 4 fput:435:sleep[1103]: x86_64-linux-gnu/ld-2.28.so, ext4, 3 fput:435:sleep[1103]: x86_64-linux-gnu/ld-2.28.so, ext4, 5 fput:435:sleep[1103]: bin/sleep, ext4, 6 fput:435:sleep[1103]: etc/ld.so.cache, ext4, 3 fput:435:sleep[1103]: etc/ld.so.cache, ext4, 2 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 3 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 6 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 6 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 7 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 7 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 8 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 8 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 7 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 6 fput:435:sleep[1103]: etc/ld.so.cache, ext4, 1 fput:435:sleep[1103]: bin/sleep, ext4, 6 fput:435:sleep[1103]: bin/sleep, ext4, 5 fput:435:sleep[1103]: bin/sleep, ext4, 4 fput:435:sleep[1103]: bin/sleep, ext4, 3 fput:435:sleep[1103]: bin/sleep, ext4, 2 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 6 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 5 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 4 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 3 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 2 fput:435:sleep[1103]: x86_64-linux-gnu/libc-2.28.so, ext4, 1 fput:435:sleep[1103]: x86_64-linux-gnu/ld-2.28.so, ext4, 5 fput:435:sleep[1103]: x86_64-linux-gnu/ld-2.28.so, ext4, 4 fput:435:sleep[1103]: x86_64-linux-gnu/ld-2.28.so, ext4, 3 fput:435:sleep[1103]: x86_64-linux-gnu/ld-2.28.so, ext4, 2 fput:435:sleep[1103]: x86_64-linux-gnu/ld-2.28.so, ext4, 1 fput:435:sleep[1103]: bin/sleep, ext4, 1 put_files_struct:450:sleep[1103]: files->count 1

On your system, the number may not reach to 1, I am afraid. Please try.

J. R. Okajima

diff --git a/fs/file.c b/fs/file.c index 3bcc1ecc314a..55b2b8e7c9a2 100644 --- a/fs/file.c +++ b/fs/file.c @@ -443,6 +443,11 @@ static struct fdtable close_files(struct files_struct files)

void put_files_struct(struct files_struct *files) { +if (!strcmp(current->comm, "sleep"))

pr_info("%s:%d:%.*s[%d]: files->count %d\n",
func, LINE,
(int)sizeof(current->comm), current->comm, current->pid,
atomic_read(&files->count)); if (atomic_dec_and_test(&files->count)) { struct fdtable *fdt = close_files(files);

diff --git a/fs/file_table.c b/fs/file_table.c index 5167f741af17..e2183499134f 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -425,6 +425,14 @@ static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);

void fput(struct file *file) { +if (!strcmp(current->comm, "sleep")

&& file_inode(file)
&& S_ISREG(file_inode(file)->i_mode))
pr_info("%s:%d:%.*s[%d]: %pD2, %s, %ld\n",
func, LINE,
(int)sizeof(current->comm), current->comm, current->pid,
file, file_inode(file)->i_sb->s_type->name,
atomic_long_read(&file->f_count)); if (atomic_long_dec_and_test(&file->f_count)) { struct task_struct *task = current;

sfjro commented 1 year ago

"J. R. Okajima":

This patch prints an opened file's reference counter of a process named "sleep". The number should be 1 in the end, and the file is freed. such

I forgot that your "sleep" is "busybox" actually. Please replace "sleep" in the patch by "busybox" when you apply.

On your system, the output will be larger than mine. So running "sleep 1" twice will help identifying the target output, I hope.

J. R. Okajima

ArtemOnanko commented 1 year ago

Hello, @sfjro

May I continue sending a patch and ask you to test?

Yes, sure.

Should I also keep all previous patches applied? (sysctl fs.file-nr, percpu_counter_sum_positive)

Thanks, Artem

sfjro commented 1 year ago

ArtemOnanko:

Should I also keep all previous patches applied? (sysctl fs.file-nr, percpu_counter_sum_positive)

In the meantime, please keep them applied.

Thanks, J. R. Okajima

ArtemOnanko commented 1 year ago

@sfjro ,

I've called "sleep 1" twice. Pls see the log attached: dmesg_sleep.txt

Artem

sfjro commented 1 year ago

ArtemOnanko:

I've called "sleep 1" twice. Pls see the log attached: dmesg_sleep.txt

Thank you very much. The result is rather interesting. It leads me to the aufs patch files, and I noticed you didn't apply aufs5-kbuild.patch. Try applying aufs5-kbuild.patch and enable CONFIG_AUFS_FS.

J. R. Okajima

sfjro commented 1 year ago

"J. R. Okajima":

Try applying aufs5-kbuild.patch and enable CONFIG_AUFS_FS.

Ah, I forgot one thing. After enabling CONFIG_AUFS_FS, you need to rebuild your kernel.

J. R. Okajima

ArtemOnanko commented 1 year ago

Hi @sfjro,

Thank you very much for your investigation! I was able to fix the undesired behavior. Since I've been following the second approach to build stand-alone aufs (no need to apply aufs5-kbuild.patch, nor copying source files to your kernel source tree) the CONFIG_AUFS_FS was not provided to the Kernel thus leading to misbehavior of the aufs5-mmap.patch which depends on this config option: #if IS_ENABLED(CONFIG_AUFS_FS)

Regards, Artem

sfjro commented 1 year ago

ArtemOnanko:

Thank you very much for your investigation! I was able to fix the undesired behavior. Since I've been following the second approach to build stand-alone aufs (no need to apply aufs5-kbuild.patch, nor copying source files to your kernel source tree) the CONFIG_AUFS_FS was not provided to the Kernel thus leading to misbehavior of the aufs5-mmap.patch which depends on this config option: #if IS_ENABLED(CONFIG_AUFS_FS)

Exactly. The document is wrong, and I am going to fix it. Thank you for your patience and the repeated tests.

J. R. Okajima

sfjro commented 1 year ago

------- Blind-Carbon-Copy

From: "J. R. Okajima" @.> To: @. Subject: aufs5 and aufs6 GIT release (v6.2-r1) MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: @.> Date: Mon, 27 Feb 2023 02:09:28 +0900 Message-ID: @.>

o bugfix

- replace CONFIG_AUFS_FS by 1 (always enabled) in aufs[56]-mmap.patch, reported by ArtemOnanko (on github.com)

o news

- new aufs6.2 branch follows linux-v6.2.

J. R. Okajima

- aufs5-linux.git nothing new
- aufs5-standalone.git nothing new except aufs5-mmap.patch
- aufs6-linux.git nothing new except aufs6.2
- aufs6-standalone.git nothing new except aufs6.2 and aufs6-mmap.patch
- aufs-util.git nothing

------- End of Blind-Carbon-Copy

sfjro / aufs-standalone

Possible bug in aufs5.19: /sys/module/aufs/refcnt overflow #24

Now we need to switch to a new root. Originally the problem was noticed a= fter the busybox switch_root command:=0D =0D exec switch_root -c /dev/console /newroot /sbin/init $@=0D =0D however, the problematic behavior could be seen after chroot command as w= ell:=0D =0D chroot /newroot=0D

some commands if you want

for i in /sys /proc /dev /anything_you_want

chroot /another_newroot

some commands if you want

sysctl fs.file-nr

exit

for i in /sys /proc /dev /anything_you_want

sysctl fs.file-nr

include "internal.h"