koverstreet / bcachefs

Other
680 stars 70 forks source link

Calling readdir always gives EOVERFLOW on 32-bit systems #650

Open Architector4 opened 8 months ago

Architector4 commented 8 months ago

Been enjoying bcachefs (genuinely, thank you for the good work!), and then pulling hair for the past day due to this preventing DXVK on Wine from getting Vulkan drivers when running 32bit Windows programs.

It appears that using <dirent.h> functions, in particular opendir and readdir, errors out when run by an x86 program on x86_64. For all I know this may be an issue for x86 in general, but I didn't test that.

This seems to happen only on mounted bcachefs, both on latest updated Arch Linux and Fedora 39, both on stock Linux v6.7.3, even with a simple fresh fallocate -l 512M test.img; bcachefs format test.img; mount -o loop test.img /mnt; touch /mnt/{1..10} filesystem.

Doesn't seem to happen with tmpfs or ext4.

test.c ```C #include #include #include #include int main(int argc, char **argv) { if(argc != 2) { printf("usage: %s \n", argv[0]); return 2; } DIR* thedir = opendir(argv[1]); if(thedir == NULL) { printf("could not opendir\n"); return 1; } while (1) { errno = 0; struct dirent* entry = readdir(thedir); if(entry == NULL) { if(errno != 0) printf("error %d on readdir\n", errno); return 0; } printf("name: %s\n", entry->d_name); } return 0; } ```
test.sh ```shell #!/bin/sh gcc test.c -m32 -ot32 gcc test.c -ot64 echo "32bit:" ./t32 / echo "64bit:" ./t64 / ```

Returns an errno of EOVERFLOW 75 Value too large for defined data type.

Output when running tesh.sh ``` $ ./do.sh 32bit: name: . error 75 on readdir 64bit: name: . name: .. name: backup name: tmp name: sys name: var name: opt name: proc name: usr name: home name: srv name: etc name: lost+found name: boot name: sbin name: nobackup name: media name: root name: bin name: lib name: lib64 name: mnt name: run name: dev ```
Strace log for the errorring binary ```strace $ strace ./t32 / execve("./t32", ["./t32", "/"], 0x7ffed07184b8 /* 62 vars */) = 0 [ Process PID=57021 runs in 32 bit mode. ] strace: WARNING: Proper structure decoding for this personality is not supported, please consider building strace with mpers support enabled. brk(NULL) = 0x66cfd000 mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe9e17000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3 statx(3, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=304575, ...}) = 0 mmap2(NULL, 304575, PROT_READ, MAP_PRIVATE, 3, 0) = 0xe9dcc000 close(3) = 0 openat(AT_FDCWD, "/usr/lib32/libc.so.6", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3 read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000\r\2\0004\0\0\0"..., 512) = 512 statx(3, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0755, stx_size=2239240, ...}) = 0 mmap2(NULL, 2250556, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xe9ba6000 mmap2(0xe9bc5000, 1585152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1f000) = 0xe9bc5000 mmap2(0xe9d48000, 507904, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a2000) = 0xe9d48000 mmap2(0xe9dc4000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x21d000) = 0xe9dc4000 mmap2(0xe9dc7000, 18236, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xe9dc7000 close(3) = 0 set_thread_area({entry_number=-1, base_addr=0xe9e18440, limit=0x0fffff, seg_32bit=1, contents=0, read_exec_only=0, limit_in_pages=1, seg_not_present=0, useable=1}) = 0 (entry_number=12) set_tid_address(0xe9e184a8) = 57021 set_robust_list(0xe9e184ac, 12) = 0 rseq(0xe9e188e0, 0x20, 0, 0x53053053) = 0 mprotect(0xe9dc4000, 8192, PROT_READ) = 0 mprotect(0x65f9b000, 4096, PROT_READ) = 0 mprotect(0xe9e51000, 8192, PROT_READ) = 0 ugetrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0 munmap(0xe9dcc000, 304575) = 0 openat(AT_FDCWD, "/", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 3 statx(3, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=STATX_ATTR_MOUNT_ROOT, stx_mode=S_IFDIR|0755, stx_size=0, ...}) = 0 getrandom("\x7c\x74\xae\x85", 4, GRND_NONBLOCK) = 4 brk(NULL) = 0x66cfd000 brk(0x66d1e000) = 0x66d1e000 brk(0x66d1f000) = 0x66d1f000 getdents64(3, 0x66cfd1bc /* 24 entries */, 32768) = 616 _llseek(3, 1, [1], SEEK_SET) = 0 statx(1, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFCHR|0620, stx_size=0, ...}) = 0 write(1, "name: .\n", 8name: . ) = 8 getdents64(3, 0x66cfd1bc /* 23 entries */, 32768) = 592 write(1, "error 75 on readdir\n", 20error 75 on readdir ) = 20 exit_group(0) = ? +++ exited with 0 +++ ```
DarkKirb commented 8 months ago

This is very likely a problem with the libc which would require an ABI break to fix

koverstreet commented 8 months ago

Try the 32 bit inodes option

DarkKirb commented 8 months ago

thought this was the default, but yeah that’s the only thing that can reasonably overflow in the dirent struct

Architector4 commented 8 months ago

Try the 32 bit inodes option

Doesn't help, same issue even directly after bcachefs format --inodes_32bit test.img and mounting

(also, how do you unset an option? i did it on my / too and remounted and I'm unsure if I want that because it didn't help lol)

(edit: nevermind, it appears to be default anyway)

DarkKirb commented 8 months ago

We are working together on debugging the issue and it appears like it’s the d_off field that overflows:

image

Architector4 commented 8 months ago

(mistyped d_off as d_d_off here, whoops)

koverstreet commented 8 months ago

Try changing opts/str_hash to crc32c, then create a new directory and see if it works there

Architector4 commented 8 months ago

Try changing opts/str_hash to crc32c, then create a new directory and see if it works there

Doesn't solve the issue unless you're lucky, it seems.

I did this setup:

bcachefs format --str_hash=crc32c test.img
sudo mount -o loop test.img /mnt
sudo touch /mnt/{1..65535}

Trying with the test program on 32bit I get a lot of files showing up, but the issue still appears:

...

name: 23966
d_ino: 1610618570
d_off: 2147421063

name: 19679
d_ino: 1610614539
d_off: 2147477862

error 75 on readdir

On 64bit it continues instead, and the next entry that fails to show up is:

name: 5132
d_ino: 1342178803
d_off: 2147734201

That d_off is 250553 more than 2**31. d_off is defined as type t_off which is a signed integer; I guess that causes this here lol

koverstreet commented 8 months ago

I can add a new str_hash function that limits the result to 2**31, but I'd really like to track down exactly where the -EOVERFLOW comes from to confirm - have you come across that yet?

In fs/readdir.c I only see -EOVERFLOW being returned when d_ino doesn't fit, curious.

Architector4 commented 8 months ago

Jumped around with gdb, and it appears to happen in glibc (2.39 on Arch here), in sysdeps/unix/sysv/linux/getdents.c, in function __getdents.

First time it reaches that line it returns normally. Second time, within the same readdir call from my program, it branches at the if statement at line 89, but last_offset is 1, and so returns with the . entry, and that's what I get. Then after another readdir call it both branches there and last_offset is -1, and so it goes to the EOVERFLOW.

I don't know what's the definitive place to link glibc code, but I wandered to here from GNU's site: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/getdents.c;h=9ac6a4d90db01ccc58484af2a3584bc6583ab1f7;hb=HEAD#l89

  87       DIRENT_SET_DP_INO (&outp->u, d_ino);
  88       outp->u.d_off = d_off;
  89       if ((sizeof (outp->u.d_ino) != sizeof (inp->k.d_ino)
  90            && outp->u.d_ino != d_ino)
  91           || (sizeof (outp->u.d_off) != sizeof (inp->k.d_off)
  92               && outp->u.d_off != d_off))
  93         {
  94           /* Overflow.  If there was at least one entry before this one,
  95              return them without error, otherwise signal overflow.  */
  96           if (last_offset != -1)
  97             {
  98               __lseek64 (fd, last_offset, SEEK_SET);
  99               return outp->b - buf;
 100             }
 101           return INLINE_SYSCALL_ERROR_RETURN_VALUE (EOVERFLOW);
 102         }

Both outp and inp are unions of dirent u and dirent64 k. The code appears to store the data in 64bit values, then check for 32bit overflow here, and then copy them to the dirent that's being output if the check passes.

Architector4 commented 8 months ago

I can add a new str_hash function that limits the result to 2**31,

Also, I want to mention that I'd prefer there to be some kind of a solution to this that would work out of the box with default settings.

In my case it very unintuitively prevented 32-bit Windows games from Wine launching with DXVK installed, which is a fairly likely setup that many Linux users playing videogames would run into, especially considering Steam (a popular game distribution platform) uses Wine with DXVK by default for all Windows games, and many games are still 32-bit.

I feel like it would be hard to communicate to every person who would run into this issue in this angle to select the 31-bit hash function for bcachefs before installing/reinstalling the 32bit Vulkan drivers, especially if bcachefs starts being used by non tech savvy users, maybe from distros adopting it.

...Or maybe it would be possible to get the developers of libvulkan.so to use different functions for this one place, but I don't know how possible that would be, and the issue would likely still trip people up in the future from the most random 32bit programs/libraries, leaving them a good weekend of debugging to figure out like this lol

koverstreet commented 8 months ago

yeah I sympathize, this is a tricky issue. I can't promise that we'll be able to make it work out of box; putting in arbitrary limits on directory size is not a great outcome either. But we can definitely look at what the other filesystems have done and see if anyone's found an acceptable compromise/hack

DarkKirb commented 8 months ago

In fs/readdir.c I only see -EOVERFLOW being returned when d_ino doesn't fit, curious.

The kernel doesn’t return -EOVERFLOW, according to the strace above, some place in glibc it sets errno to EOVERFLOW

DarkKirb commented 8 months ago

one solution (idk how sound it would be) would be to mask d_off to fit in 31 bits. This would break with more than 2 billion directory entries in a directory, but i presume you’d have worse problems at that point.

Damenly commented 8 months ago

yeah I sympathize, this is a tricky issue. I can't promise that we'll be able to make it work out of box; putting in arbitrary limits on directory size is not a great outcome either. But we can definitely look at what the other filesystems have done and see if anyone's found an acceptable compromise/hack

For btrfs side, there are two items: DIR_ITEM and DIR_INDEX. The DIR_ITEM offset is hash of file name like bcachefs. DIR_INDEX offset starts from 2 which is assigned to d_off and the index can be reused if it's free. So it won't be a problem for btrfs.

However, IIUC bcachefs uses 64-bit hash in dirent key offset so It can happen the first dirent offset overflows u32 max.

leahneukirchen commented 8 months ago

I think -D_FILE_OFFSET_BITS=64 should fix this for 32-bit userland, and should always be used these days.

epowlison commented 7 months ago

I can add a new str_hash function that limits the result to 2**31 [...]

If you do end up going the route of a new hash function, I'd suggest a 64-bit sign-extented CRC32C! That'd get a full 32-bits through the check by glibc.

koverstreet commented 7 months ago

Could you elaborate?

epowlison commented 7 months ago

The comparison that 'glibc' is doing to check for truncation in d_off, is being done as a signed comparison with a int64_t and an off_t - meaning the 32-bit off_t gets sign-extended to 64-bits and compared with the original value. So, we match that same operation, the overflow check passes, and we don't have to drop any bits. Not a huge help, but if a new hash function's needed, that's the way to go.

On the opposite side of the problem, if glibc was doing an unsigned check, it'd be working now with any 32-bit hash function.

epowlison commented 7 months ago

Having peaked, prodded, replicated, and tested some patches against it, I think the code in glibc's __getdents(), readdir(), telldir(), and seekdir() is all absolutely correct for 32-bit code accessing a 64-bit kernel -- even though the behavior wouldn't exist for 32-bit code on a 32-bit kernel.

It all comes down to where d_off actually gets used, in seekdir() -- it translates directly to an lseek() on the directory, which means the 32-bit d_off cookie will get sign-extended when it gets handed back to the kernel. Both in the case of glibc's large file support, and in the case of the kernel's own 32-bit syscall ABI.

This means that, while it's possible 'glibc' is one of the few libraries that will detect the truncation and throw error, the issue will effect all 32-bit programs and libraries that don't make use of 64-bit directory offsets, no matter the language.

Here's the testing program I've written, avoiding 'glibc's own directory functions and using the kernel ABI directly. It implements a directory listing, and a seek test for the first file with a d_off over 31 bits. I've compiled it to both 64 bits and 32 bits, and tested it on a fresh and plain bcachefs volume with --str_hash=crc32c, containing dummy files 'a-z'.

On 64 bits, it's fully successful, in listing and in seeking:

$ ./dirent64 /mnt
listing '/mnt'
           d_ino            d_off d_name
            1000                1 .
            1000          8C2A0C0 ..
        4000000D         1B925334 m
        4000000E         2D6D73B3 n
        40000018         2E634728 x
[...]
seeking '/mnt'
           d_ino            d_off d_name
        40000016         821BF80F v
        40000005         89972D8C e
        40000000         914B0BFB lost+found
        40000006         92453F60 f
[...]

On 32 bits, it can correctly produce a listing, even for files with an offset above 31-bits, but fails to seek:

$ ./dirent32 /mnt
listing '/mnt'
   d_ino    d_off d_name
    1000        1 .
    1000  8C2A0C0 ..
4000000D 1B925334 m
4000000E 2D6D73B3 n
[...]
40000016 821BF80F v
40000005 89972D8C e
40000000 914B0BFB lost+found
40000006 92453F60 f
[...]
seeking '/mnt'
seek failed

That's with zero type conversion -- d_off is read in as an unsigned long as specified by the ABI, the rewind token it gets copied off to is an unsigned long, and the varargs on the syscall should mean that it gets passed off as an unsigned long too. More or less, we should be handing the kernel back the exact value it gave us, and it's making a failed seek.

I'll also note that this is the same behavior we get from glibc's own functions, if we remove the truncation check. seekdir()'s lseek() call gets translated to lseek64() wherever the interface is available, with sign-extension. Somebody should definitely check if seekdir() works on a 32-bit kernel -- it might not work either.

...

I'm no filesystem dev, but my bet is the easiest solution would be to make the CRC32C hash hand-off sign-extended 64-bit cookies for dirents, and make the directory seeks truncate 64-bit cookies back down to 32-bit for directories with that hash. If that works, then no data would need to change on any existing disks, but 32-bit programs would work on CRC32C 'str_hash' filesystems with 64-bit kernels (including glibc's functions). It's still not an 'out-of-the-box' solution, but it'd certainly help I think!

EDIT: I'm realizing it's possible that seeks to absolute negative offsets are filtered out early in the syscall for SYS_lseek (which the wording on the manual page implicitly suggests), which might nullify that solution. Definitely a little out of my domain!

EDIT: That might be the case judging by the 63-bit directory offsets I'm seeing elsewhere lol. A 31-bit hash might be the way to go. Apologies!

Architector4 commented 7 months ago

I want to clarify that I never tested any 32bit Linux kernel with bcachefs for this issue. The title is about experiencing the issue only on 32bit applications, not 64bit, but I have no idea if the issue exists on a fully 32bit system.

epowlison commented 7 months ago

Understood! It's just a predicted new issue that might come from the same source of trouble. I don't know yet either whether or not it'll be the case, but it's certainly something to keep in mind.

epowlison commented 7 months ago

Issue upgrade -- effects native x86 as well. Any hash above 31 bits on x86 will give readdir(3) an EOVERFLOW, lseek(2) EINVAL, and break seekdir(3). -D_FILE_OFFSET_BITS=64 cannot fix seekdir(3), since POSIX specifies its type to be long rather than off_t. Whether or not readdir(3) errors out will be based on whether or not Linux is able to provide 64-bit directory entries on an architecture, since that's what glibc always prefers if it's available. Likely, this effects all 32-bit architectures.

epowlison commented 7 months ago

As for finding productive ways to fix this issue, I've noticed that the ext4 implementation breaks apart its hashes into a 'major:minor' of 32-bit hashes. Both are handed out at once to programs calling from a 64-bit API -- they can seek to any file exactly -- and just the major hash is sent to 32-bit programs, and are left-shifted back up to 64 bits and made to be inexact seeks. There doesn't seem to be any requirement that d_off needs to be unique and exact, so by producing duplicate d_off's it looks like it might very simply still be able to access more than 2 billion directory entries on 32-bit.

If that's a correct reading of the code, then this might honestly be a relatively easily adapted and adequate solution.

epowlison commented 7 months ago

Apologies for channel spam, but it's confirmed! -- While ext4 uses 63-bit hashes internally, it detects 32-bit programs, and translates in and out just the top 31 bits of the hash. Seeks are simply inexact, and duplicate d_off's are allowed and handed out.

# touch {1..100000}
# dirent32 . | awk '{print $2}' | uniq -D
61D22D3
61D22D3
9D1ABFF
9D1ABFF
D452E9B
D452E9B
2261440B
2261440B
5538EE49
5538EE49
# dirent64 . | awk '{print $2}' | uniq -D
#

This means we have a very simple model to apply in order to resolve this issue and get 32-bit working out of the box, without placing any limits on the content of directories.