Closed ImanolBarba closed 3 months ago
Are you considering adding a slog device? Try to add one and see if it improves.
What tool/setup are you using to copy the 2.2T file? Any network services involved on either side?
Do you see high usage/pinned CPUs? If so, does sudo perf top
give you any clues?
Does this happen during or after the copy? If during, at what point does performance degrade? Is it all at once or little by little?
It looks a bit like something in sync context is doing computation work proportional to the filesize based on the long stime
but modest write bandwidth in txgs
. As the file grows maybe larger more work is done slowing it down?
What tool/setup are you using to copy the 2.2T file? Any network services involved on either side?
No, just cp -av
Do you see high usage/pinned CPUs? If so, does sudo perf top give you any clues?
No more than 30% sys cpu occasionally, likely due to compression
Does this happen during or after the copy? If during, at what point does performance degrade? Is it all at once or little by little?
During, after copying about 4.5 TB
I did add an slog, but since the writes are async it did nothing
What if you test it with "fio" tool? Also make sure you don't use up all the memory when issue happens.
fio
shows the about the same throughput as zpool status
(Total) memory usage never goes beyond 70G, which is expected since ARC is supposed to take half of those 128G
So I forgot to update this issue, but it turned out that I had bad RAM (started getting errors reported on files randomly and confirmed bad RAM with memtest).
After replacing the RAM, I don't have the issue anymore, why did it happen? I have no clue, but with the same setup I have no performance issues at all
System information
Describe the problem you're observing
I have a RAIDZ2 pool that is showing a strong write performance degradation when a very large file (2.2TB) is being copied into it. Usually, copying fairly large files (GBs in magnitude) yields about 1GB/s to 1.3GB/s with occasional dips to 600MB/s, a while after the big file starts copying it drops to about 20-30 MB/s as shown by
zpool iostat -yv 2
As compared to normal operation:
The decrease in throughput is actually a decrease in IOPS
Nothing interesting on dmesg besides complains that the
txg_sync
task is taking more than 120s, which is expected given the issue.Describe how to reproduce the problem
I usually copy files from another pool (which is showing no performance issues) to this pool. The files being copied are backups of differing sizes, some 100s of GB, a few ~2TB. Files are copied at a good rate until it hits the very large 2.2TB file, a while after copying it, write performance drops significantly. Read performance stays unaffected.
The issue persists until the host is rebooted and I start to copy smaller sized files again, exporting and importing the pool does not resolve the issue.
Include any warning/errors/backtraces from the system logs
Pool parameters:
Some other metrics of interest I gathered while the issue is present
txgs
:dmu_tx_assign
zpool iostat bckpool -r 2
This one is interesting because it shows the request size as 16K, when the pool has 1M recordsize (also had the same issue with the default recordsize of 128K)
During regular operation (same pool):
zpool iostat bckpool -q 2
zpool iostat bckpool -w 2
zpool iostat bckpool -l 2