Open jcbdev opened 5 years ago
I just tried with all six disks. here is the output. notice the 19 MB/s per drive. Are all the sata ports multiplexed together on one lane or something?
root@gnubee-n1:~# dd if=/dev/zero of=/data/brick1/test bs=1M count=1000 & dd if=/dev/zero of=/data/brick2/test bs=1M count=1000 & dd if=/dev/zero of=/data/brick3/test bs=1M count=1000 & dd if=/dev/zero of=/data/brick4/test bs=1M count=1000 & dd if=/dev/zero of=/data/brick5/test bs=1M count=1000 & dd if=/dev/zero of=/data/brick6/test bs=1M count=1000
[1] 3451
[2] 3452
[3] 3453
[4] 3454
[5] 3455
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 53.403 s, 19.6 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 53.3504 s, 19.7 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 54.4687 s, 19.3 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 54.5439 s, 19.2 MB/s
[2] Done dd if=/dev/zero of=/data/brick2/test bs=1M count=1000
[3] Done dd if=/dev/zero of=/data/brick3/test bs=1M count=1000
[5]+ Done dd if=/dev/zero of=/data/brick5/test bs=1M count=1000
root@gnubee-n1:~# 1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 55.0374 s, 19.1 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 54.3202 s, 19.3 MB/s
root@gnubee-n1:~#
According to the MediaTek specs the chips has 3x PCI express lanes and looking at the specs for the ASM1061 (and the commodity 1061 cards you can buy on ebay) they claim you can run 2 full sata III 6gbp/s ports on the pci-ex lane. Looking at the bottom of the board (gnubee PC2) I can see 3 ASM1061 chips all seemingly connected to a different lane dirrectly to the mediatek chip.
So from a theoretical hardware perspective that all seems to add up to the prospect of 6 full speed sata ports! But as you can see above the max throughput I can put through all the busses at the same time is less then 1 saturated SATA III port. Far far less! Only getting around 100MB/s when "theoretically" even one sata II port should be able to hit 300MB/s. Even the orignal pci-ex v1.0a has a theoretical throughput of 250MB/s so even if it was 3 pci-ex v1.0a lanes we should be able to see higher than 100MB/s across all devices.
This suggests there's a kernel issue or driver issue somewhere that needs to be addressed, unless I'm misinterpreting the specs (or maybe a hardware bottle neck elsewhere?). I'd love to help work on this if you think its an issue. (sorry for the spam this is my first time hacking at a kernel for an sbc device and I'm loving it). Just need some direction of where to focus my efforts!
http://www.asmedia.com.tw/eng/e_show_products.php?item=118 https://wikidevi.com/wiki/MediaTek_MT7621
I own a GB-PC2, up with the latest kernel provided by @neilbrown (thanks a LOT, by the way). I tested a parallel dd on 3 disks, without RAID. I have similar results. Around 37MB/s on all 3, 110MB/s overall. Same overall amount for 2 paralleled dd. The CPU is quite busy with these 3 dd: more than 4 of load average, 2,5 with 2 dd. And I have no FS, LVM or RAID set up. Maybe it's a matter of limited computing power.
I think that CPU0 handles all the interrupts. I wonder if distributing them would help (or hurt.
These three files /sys/devices/pci0000:00/0000:00:00.0/pci_bus/0000:01/cpuaffinity /sys/devices/pci0000:00/0000:00:01.0/pci_bus/0000:02/cpuaffinity /sys/devices/pci0000:00/0000:00:02.0/pci_bus/0000:03/cpuaffinity
all contain 'f' If you set them to 1 2 and 4 then the interrupts might all go to different CPUs. "cat /proc/interrupts" will show you how many land on which CPU.
It might be an interesting experiment
I don't know how to change this. These "files" are read only. I tried to use the irqbalance daemon instead. Then the interrupts where "distributed", one pci channel interrupts to one cpu core. The results are the same, more or less.
I'm convinced the marvell chip is using a pcie "switch" internally to provide the three lanes rather than it genuinely having 3 lanes. I did find a datasheet somewhere which made me suspect this but I can't remember where (posted a link on gnubee google group i think) - https://groups.google.com/forum/#!topic/gnubee/5_nKjgmKSoY
Here are some tests I did that can hopefully give people things to compare to. I'm running a raid-5 array with bcache and btrfs. Bcache doesn't do anything for performance here, in fact is should do the opposite, but the main reason I'm using a GNUBEE is because I'm working on a solar-powered off-grid solution and I had hoped to be able to minimize power usage using it.
#Copying dev-zero into a ramdisk
traverseda@storage:~$ dd if=/dev/zero of=/tmp/ramdisk/test.img bs=10k count=10k
10240+0 records in
10240+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.705581 s, 149 MB/s
#Copying dev-zero into dev-zero, this might be no-op?
traverseda@storage:~$ dd if=/dev/zero of=/dev/zero bs=10k count=10k
[ lines omitted for brevity from now on ]
104857600 bytes (105 MB, 100 MiB) copied, 0.0727784 s, 1.4 GB/s
#Copying dev-zero into the ext4 root partition, which is not part of any raid array or cache
# and is directly on an SSD
traverseda@storage:~$ dd if=/dev/zero of=~/test.img bs=10k count=10k
104857600 bytes (105 MB, 100 MiB) copied, 1.72939 s, 60.6 MB/s
#Copying dev-zero onto a raid array with 5 drives, plus bcache.
traverseda@storage:~$ dd if=/dev/zero of=/mnt/array/traverseda/test.img bs=10k count=10k
104857600 bytes (105 MB, 100 MiB) copied, 4.33028 s, 24.2 MB/s
# Real on-disk size of the file, since it's /dev/zero
traverseda@storage:~$ sudo compsize /mnt/array/traverseda/test.img
Type Perc Disk Usage Uncompressed Referenced
TOTAL 3% 3.1M 100M 100M
zstd 3% 3.1M 100M 100M
# Copying a randomly-generated file to /dev/zero, to ensure we get good performance
traverseda@storage:~$ dd if=/tmp/ramdisk/random.img of=/dev/zero bs=10k count=10k
104857600 bytes (105 MB, 100 MiB) copied, 0.662688 s, 158 MB/s
# Copying same to btrfs
dd if=/dev/urandom of=/mnt/array/traverseda/test.img bs=10k count=10k
104857600 bytes (105 MB, 100 MiB) copied, 8.40688 s, 12.5 MB/s
# Comparing compressed sizes...
traverseda@storage:~$ sudo compsize /mnt/array/traverseda/random.img
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 98M 98M 98M
none 100% 98M 98M 98M
So you can see a few things here. One is that more CPU-intensive operations like compression actually speed up the file transfer, but not as much as you'd think. I push the CPU pretty hard with multiple layers of indirection and I actually get better performance on more compressible files. If I completely remove bcache I get exactly the same result as near as I can tell.
So a few interesting points
I get this on a Crucial 120GB BX500 sata SSD plugged in first slot, on a XFS filesystem, latest image:
root@gb1:/# uname -a
Linux gnubee 5.19.13+ #3 SMP Wed Oct 12 13:41:24 AEDT 2022 mips GNU/Linux
root@gb1:/# time dd if=/dev/zero of=/mnt/zeroed.bin bs=16k count=50000 conv=fdatasync
50000+0 records in
50000+0 records out
819200000 bytes (819 MB, 781 MiB) copied, 6.5946 s, 124 MB/s
real 0m6.942s
user 0m0.081s
sys 0m6.629s
I roughly tried various block sizes, 16k looks like the sweet spot...
This is probably more of a question as you seem to have more experience with hardware than I do (I am a programmer). I have been messing around with all the various raid combinations on the device (raid 0, 1 and 10) and no matter which raid combo I try I get the exact same write speed (about 72mbps) which I believe is the max write speed of the disks I am using (when used singularly)
So I tried a dd if=/dev/zero of=test bs=1M count=1000 on two threads to two of the drives and got the same combined write speed 72mbps (when you add the speeds together)
This got me scratching my head a bit. Is there only 1 6gbps lane all the ports are multiplexed through or is there something strange going on? I should have at least gotten above the write speed of a singular drive?
I have tried on old kernels too (like the original v3 kernel) and different debian distros (jessie/stretch and buster) and I think I am getting the same results.
Is there a hardware limitation I am missing or something?
On a side note, thanks for all your hard work! It has been inspiring for me playing with your tools and building the kernels etc.