tytso / e2fsprogs

Ext2/3/4 file system utilities
http://ext4.wiki.kernel.org
374 stars 221 forks source link

mke2fs can spin forever when creating very large (e.g. 256PB) file systems #95

Open DemiMarie opened 2 years ago

DemiMarie commented 2 years ago

mke2fs can’t handle a 7EiB image (see https://rwmj.wordpress.com/2020/11/04/ridiculously-big-files/ for how to make such a thing). mkfs.btrfs finishes promptly.

DemiMarie commented 2 years ago

To be clear, this is absolutely a pathological case and I would be fine with you closing this as “won’t fix”, but it would be nice for mke2fs to state how much of the (obviously thin-provisioned) storage it will actually use.

tytso commented 2 years ago

Hmm, I'm not able to reproduce it using a loop device:

  # mkfs.xfs -f /dev/cwcc/scratch
  # mount /dev/cwcc/scratch /mnt
  # truncate -s 7E /mnt/foo.img
  # losetup /dev/loop0 /mnt/foo.img
  # getblock --getsz /dev/loop0
  15762598695796736  # 7EB in units of 512 byte sectors
  # mke2fs -t ext4 /dev/loop0
  # mkfs.ext4 /dev/loop0
  mke2fs 1.46.4 (18-Aug-2021)
  mkfs.ext4: Size of device (0x7000000000000 blocks) /dev/loop0 too big to create
        a filesystem using a blocksize of 4096.

I also can't reproduce this by running mke2fs on the bare file:

  # mkfs.ext4 /mnt/foo.img 
  mke2fs 1.46.4 (18-Aug-2021)
  mkfs.ext4: Size of device (0x7000000000000 blocks) /mnt/foo.img too big to create
          a filesystem using a blocksize of 4096.

I am seeing an interesting bug when I try to create a file system a bit larger than 64PB and 1EB, caused by the fact that we have a 2**32 limit on the number of inodes, and e2fsprogs doesn't handle the case where the number of inodes is less than 8 inodes per block group. Until we fix this particular shortcoming, it means that using 4k blocks, the current implementation limit (imposed by e2fsprogs) is 64PB. In practice though, that's far larger than most people will find it useful to use ext4. And quite frankly, at that size, it's likely that XFS is a better chocie --- although Red Hat doesn't support XFS file systems larger than 1PB, even in RHEL 8. (For RHEL 7 the maximum supported XFS file system size is 500TB.)

tytso commented 2 years ago

Actually, if you wanted to create a file system that is, say 256 PB using ext4, you can do this by using the bigalloc file system option. This creates a file system using a default cluster size of 64k, so that reduces the number of block groups, and reduces the the block allocation bitmap overhead by a factor of 16 and reducing the number of block groups by a factor of 256. But again, file system of these sizes are more of a curiosity than something that I would recommend for production use unless you have a very specialized use case.

Indeed, using file systems of this size using any file system type is probably not a good idea for a single node server --- since it means that if you lose a single power supply or server, a very large amount of storage goes off-line, and the time to recover a file system of that size if it gets corrupted is substantial. So certainly by the time you get up to petabyte file systems, and probably before that point you should have switched over to using a cluster file system.such as Colossus[1]

[1] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf

rwmjones commented 2 years ago

I wonder if this has been fixed (since the blog post last year) since I can't reproduce this with mkfs.ext4 from 28-Feb-2021. It works -- or rather -- fails nicely for me now:

$ rm /var/tmp/disk.img
rm: cannot remove '/var/tmp/disk.img': No such file or directory
$ touch /var/tmp/disk.img
$ nbdfuse /var/tmp/disk.img --command nbdkit -s memory 7E &
[1] 1535874
$ ls -lh /var/tmp/disk.img 
-rw-rw-rw-. 1 rjones rjones 7.0E Dec 28 20:12 /var/tmp/disk.img
$ mkfs.ext4 /var/tmp/disk.img
mke2fs 1.46.2 (28-Feb-2021)
mkfs.ext4: Size of device (0x7000000000000 blocks) /var/tmp/disk.img too big to create
    a filesystem using a blocksize of 4096.
rwmjones commented 2 years ago

I cannot reproduce it with mke2fs 1.45.6 (20-Mar-2020) either.

I wonder if I meant mkfs.xfs because that tool does "spin", but it is doing work and causing nbdkit storing the backing file to use more and more memory.

Anyway likely this is not a bug in e2fsprogs, so please close it - sorry for the noise!

tytso commented 2 years ago

Well, it was actually useful for me to take a look, because I was able to make mkfs.ext4 "spin" forever while creating a 256 PB image without the bigalloc option. So:

  # truncate -s 256P /mnt/foo.img
  # mkfs.ext4 -t ext4 /mnt/foo.img

Will "spin" forever, and that's due to the fact that e2fsprogs was trying to create a file system with 2 inodes per block group, and e2fsprogs doesn't support inodes_per_group that are not a multiple of 8, because of how it tries to handle bit arrays, and so it tries to "round up" the inodes per group to 8, but then the inode count became larger than 2**32, and Hilarity Ensues. So that's a real mkfs.ext4 bug, although in practice no one has noticed it until now, since in real life it's not practical to try to create ext4 file systems that large ---- although that hasn't stopped a CentOS 7 user from trying to resize an ext4 file system to 1PB, and asking me for help, which is how I noticed that Red Hat doesn't officially support ext4 file systems larger than 50TB, and xfs file systems are supported only up to 500TB for RHEL 7. :-)