hpc / ior

IOR and mdtest
Other
374 stars 165 forks source link

Excessive file fragmentation on parallel file creation #496

Open bolausson opened 1 month ago

bolausson commented 1 month ago

Hi there,

It looks like parallel file creation with IOR (4.0.0) causes unnecessary file fragmentation.

Here is an example and a comparison with FIO (serialised and parallelised file creation). Even with parallel file creation, FIO does a very good job of keeping fragmentation to a minimum.

Is there any chance of improving this?

  1. DD single file (just for reference)

    bo@x440-01 $ dd if=/dev/random of=test-dd bs=1M count=1024 oflag=sync
    bo@x440-01 $ filefrag test-dd 
    test-dd: 1 extent found
  2. IOR single process

    bo@x440-01 $ mpirun -np 1 ior -a POSIX -w -F -e -g -t 1m -b 1g -k -o /scratch/bolausson/hdd/ior/test-plain.ior
    bo@x440-01 $ filefrag test-plain.ior.00000000 
    test-plain.ior.00000000: 1 extent found
  3. IOR 10 processes

    bo@x440-01 $ mpirun -np 10 ior -a POSIX -w -F -e -g -t 1m -b 1g -k -o /scratch/bolausson/hdd/ior/test-plain.ior
    bo@x440-01 $ for i in test-plain.ior.0000000* ; do filefrag ${i} ; done
    test-plain.ior.00000000: 116 extents found
    test-plain.ior.00000001: 115 extents found
    test-plain.ior.00000002: 90 extents found
    test-plain.ior.00000003: 95 extents found
    test-plain.ior.00000004: 91 extents found
    test-plain.ior.00000005: 116 extents found
    test-plain.ior.00000006: 97 extents found
    test-plain.ior.00000007: 118 extents found
    test-plain.ior.00000008: 107 extents found
    test-plain.ior.00000009: 111 extents found
  4. FIO single process

    bo@x440-01 $ fio --name fio-serial --numjobs=1 --create_serialize=1 --ioengine=sync --size=1g --blocksize=1M --group_reporting=1 --rw=write --directory=/scratch/bolausson/hdd/ior
    bo@x440-01 $ filefrag fio-serial.0.0
    fio-serial.0.0: 1 extent found
  5. FIO 10 processes, create searialize (default behaviour)

    fio --name fio-serial-multi --numjobs=10 --create_serialize=1 --ioengine=sync --size=1g --blocksize=1M --group_reporting=1 --rw=write --directory=/scratch/bolausson/hdd/ior
    bo@x440-01 $ for i in fio-serial-multi.* ; do filefrag ${i} ; done
    fio-serial-multi.0.0: 1 extent found
    fio-serial-multi.1.0: 1 extent found
    fio-serial-multi.2.0: 1 extent found
    fio-serial-multi.3.0: 1 extent found
    fio-serial-multi.4.0: 1 extent found
    fio-serial-multi.5.0: 1 extent found
    fio-serial-multi.6.0: 1 extent found
    fio-serial-multi.7.0: 1 extent found
    fio-serial-multi.8.0: 1 extent found
    fio-serial-multi.9.0: 1 extent found
  6. FIO 10 processes, create parallel

    bo@x440-01 $ fio --name fio-parallel-multi --numjobs=10 --create_serialize=0 --ioengine=sync --size=1g --blocksize=1M --group_reporting=1 --rw=write --directory=/scratch/bolausson/hdd/ior
    bo@x440-01 $ for i in fio-parallel-multi.* ; do filefrag ${i} ; done
    fio-parallel-multi.0.0: 1 extent found
    fio-parallel-multi.1.0: 1 extent found
    fio-parallel-multi.2.0: 1 extent found
    fio-parallel-multi.3.0: 2 extents found
    fio-parallel-multi.4.0: 1 extent found
    fio-parallel-multi.5.0: 1 extent found
    fio-parallel-multi.6.0: 1 extent found
    fio-parallel-multi.7.0: 1 extent found
    fio-parallel-multi.8.0: 1 extent found
    fio-parallel-multi.9.0: 2 extents found
glennklockwood commented 4 weeks ago

File fragmentation is not a concept known to POSIX, so what you’re seeing is caused by your specific file system. On which one are you running this test?

bolausson commented 3 weeks ago

Oh yes, sorry, I though I mentioned the filesystemt. It is Lustre. Here some more information a colleague gathered:

The fio benchmark does preallocate files as shown in the snippet below of an fio strace

1599909 13:46:38.047201 openat(AT_FDCWD, "/hdd/ior-16m_dne2-nostriping/fio.blktracesingle/fiojob.0.0", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 6

1599909 13:46:38.047689 fallocate(6, 0, 0, 34359738368) = 0

In the IOR cases without preallocation, the files are written sequentially, although there is a SEEK to the same file offset that would have been appended anyway. Snippet of strace of an IOR process handling writing one file:

1599661 12:17:18.744826 lseek(18, 4714397696, SEEK_SET) = 4714397696

1599661 12:17:18.744877 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

1599661 12:17:18.761087 lseek(18, 4731174912, SEEK_SET) = 4731174912

1599661 12:17:18.761133 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

1599661 12:17:18.777467 lseek(18, 4747952128, SEEK_SET) = 4747952128

1599661 12:17:18.777513 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

1599661 12:17:18.793840 lseek(18, 4764729344, SEEK_SET) = 4764729344

1599661 12:17:18.793887 write(18, "\272\263\262e\0\0\0\0\10\0\0\0\0\0\0\0\272\263\262e\0\0\0\0\30\0\0\0\0\0\0\0\272\263\262e\0\0\0\0(\0\0\0\0\0\0\0\272\263\262e\0\0\0\08\0\0\0\0\0\0\0\272\263\262e\0\0\0\0H\0\0\0\0\0\0\0\272\263\262e\0\0\0\0X\0\0\0\0\0\0\0\272\263\262e\0\0\0\0h\0\0\0\0\0\0\0\272\263\262e\0\0\0\0x\0\0\0\0\0\0\0"..., 16777216) = 16777216

The offset+lengths are sequential with no gaps.