Excessive file fragmentation on parallel file creation

bolausson commented 1 month ago

Hi there,

It looks like parallel file creation with IOR (4.0.0) causes unnecessary file fragmentation.

Here is an example and a comparison with FIO (serialised and parallelised file creation). Even with parallel file creation, FIO does a very good job of keeping fragmentation to a minimum.

Is there any chance of improving this?

DD single file (just for reference)

bo@x440-01 $ dd if=/dev/random of=test-dd bs=1M count=1024 oflag=sync

bo@x440-01 $ filefrag test-dd 
test-dd: 1 extent found

IOR single process

bo@x440-01 $ mpirun -np 1 ior -a POSIX -w -F -e -g -t 1m -b 1g -k -o /scratch/bolausson/hdd/ior/test-plain.ior

bo@x440-01 $ filefrag test-plain.ior.00000000 
test-plain.ior.00000000: 1 extent found

IOR 10 processes

bo@x440-01 $ mpirun -np 10 ior -a POSIX -w -F -e -g -t 1m -b 1g -k -o /scratch/bolausson/hdd/ior/test-plain.ior

bo@x440-01 $ for i in test-plain.ior.0000000* ; do filefrag ${i} ; done
test-plain.ior.00000000: 116 extents found
test-plain.ior.00000001: 115 extents found
test-plain.ior.00000002: 90 extents found
test-plain.ior.00000003: 95 extents found
test-plain.ior.00000004: 91 extents found
test-plain.ior.00000005: 116 extents found
test-plain.ior.00000006: 97 extents found
test-plain.ior.00000007: 118 extents found
test-plain.ior.00000008: 107 extents found
test-plain.ior.00000009: 111 extents found

FIO single process

bo@x440-01 $ fio --name fio-serial --numjobs=1 --create_serialize=1 --ioengine=sync --size=1g --blocksize=1M --group_reporting=1 --rw=write --directory=/scratch/bolausson/hdd/ior

bo@x440-01 $ filefrag fio-serial.0.0
fio-serial.0.0: 1 extent found

FIO 10 processes, create searialize (default behaviour)

fio --name fio-serial-multi --numjobs=10 --create_serialize=1 --ioengine=sync --size=1g --blocksize=1M --group_reporting=1 --rw=write --directory=/scratch/bolausson/hdd/ior

bo@x440-01 $ for i in fio-serial-multi.* ; do filefrag ${i} ; done
fio-serial-multi.0.0: 1 extent found
fio-serial-multi.1.0: 1 extent found
fio-serial-multi.2.0: 1 extent found
fio-serial-multi.3.0: 1 extent found
fio-serial-multi.4.0: 1 extent found
fio-serial-multi.5.0: 1 extent found
fio-serial-multi.6.0: 1 extent found
fio-serial-multi.7.0: 1 extent found
fio-serial-multi.8.0: 1 extent found
fio-serial-multi.9.0: 1 extent found

FIO 10 processes, create parallel

bo@x440-01 $ fio --name fio-parallel-multi --numjobs=10 --create_serialize=0 --ioengine=sync --size=1g --blocksize=1M --group_reporting=1 --rw=write --directory=/scratch/bolausson/hdd/ior

bo@x440-01 $ for i in fio-parallel-multi.* ; do filefrag ${i} ; done
fio-parallel-multi.0.0: 1 extent found
fio-parallel-multi.1.0: 1 extent found
fio-parallel-multi.2.0: 1 extent found
fio-parallel-multi.3.0: 2 extents found
fio-parallel-multi.4.0: 1 extent found
fio-parallel-multi.5.0: 1 extent found
fio-parallel-multi.6.0: 1 extent found
fio-parallel-multi.7.0: 1 extent found
fio-parallel-multi.8.0: 1 extent found
fio-parallel-multi.9.0: 2 extents found

glennklockwood commented 4 weeks ago

File fragmentation is not a concept known to POSIX, so what you’re seeing is caused by your specific file system. On which one are you running this test?

bolausson commented 3 weeks ago

Oh yes, sorry, I though I mentioned the filesystemt. It is Lustre. Here some more information a colleague gathered:

The fio benchmark does preallocate files as shown in the snippet below of an fio strace

1599909 13:46:38.047201 openat(AT_FDCWD, "/hdd/ior-16m_dne2-nostriping/fio.blktracesingle/fiojob.0.0", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 6

1599909 13:46:38.047689 fallocate(6, 0, 0, 34359738368) = 0

In the IOR cases without preallocation, the files are written sequentially, although there is a SEEK to the same file offset that would have been appended anyway. Snippet of strace of an IOR process handling writing one file:

1599661 12:17:18.744826 lseek(18, 4714397696, SEEK_SET) = 4714397696

1599661 12:17:18.744877 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

1599661 12:17:18.761087 lseek(18, 4731174912, SEEK_SET) = 4731174912

1599661 12:17:18.761133 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

1599661 12:17:18.777467 lseek(18, 4747952128, SEEK_SET) = 4747952128

1599661 12:17:18.777513 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

1599661 12:17:18.793840 lseek(18, 4764729344, SEEK_SET) = 4764729344

1599661 12:17:18.793887 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

The offset+lengths are sequential with no gaps.

hpc / ior

Excessive file fragmentation on parallel file creation #496