Closed pietrushnic closed 9 years ago
I did some simple tests here: https://github.com/raspberrypi/linux/pull/652#issuecomment-52702988 How do you get 11.1MB/s write speed without DMA? Sounds impossible to me.
@notro by +DMA I mean bcm2835-dma.c with slave_sg. I will post configs to show which exact options I used.
Two tests that I performed with logs:
pi@raspberrypi ~ $ zgrep ARCH_BCM2 /proc/config.gz
CONFIG_ARCH_BCM2835=y
pi@raspberrypi ~ $ uname -a
Linux raspberrypi 3.17.1+ #59 Sat Oct 25 01:54:16 CEST 2014 armv6l GNU/Linux
pi@raspberrypi ~ $ zcat /proc/config.gz |grep -E "DMA|MMC"|grep -v "not set"
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_ARM_DMA_MEM_BUFFERABLE=y
CONFIG_ZONE_DMA_FLAG=0
CONFIG_SCSI_DMA=y
CONFIG_MMC=y
# MMC/SD/SDIO Card Drivers
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
# MMC/SD/SDIO Host Controller Drivers
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PLTFM=y
CONFIG_MMC_BCM2835=y
CONFIG_HAS_DMA=y
pi@raspberrypi ~ $ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 47.5086 s, 11.0 MB/s
real 0m47.730s
user 0m0.030s
sys 0m25.980s
real 0m8.991s
user 0m0.000s
sys 0m0.130s
pi@raspberrypi ~ $ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 33.2664 s, 15.8 MB/s
and DMA with slave_sg transfer mode
pi@raspberrypi ~ $ zgrep ARCH_BCM2 /proc/config.gz
CONFIG_ARCH_BCM2835=y
pi@raspberrypi ~ $ uname -a
Linux raspberrypi 3.17.1+ #61 Sat Oct 25 02:08:29 CEST 2014 armv6l GNU/Linux
pi@raspberrypi ~ $ zcat /proc/config.gz |grep -E "DMA|MMC"|grep -v "not set"
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_ARM_DMA_MEM_BUFFERABLE=y
CONFIG_ZONE_DMA_FLAG=0
CONFIG_SCSI_DMA=y
CONFIG_MMC=y
# MMC/SD/SDIO Card Drivers
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
# MMC/SD/SDIO Host Controller Drivers
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PLTFM=y
CONFIG_MMC_BCM2835=y
CONFIG_MMC_BCM2835_DMA=y
CONFIG_MMC_BCM2835_PIO_DMA_BARRIER=2
CONFIG_DMADEVICES=y
# DMA Devices
CONFIG_DMA_BCM2835=y
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_OF=y
# DMA Clients
CONFIG_HAS_DMA=y
pi@raspberrypi ~ $ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 49.1707 s, 10.7 MB/s
real 0m49.196s
user 0m0.030s
sys 0m15.270s
real 0m13.054s
user 0m0.000s
sys 0m0.070s
pi@raspberrypi ~ $ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 31.1292 s, 16.8 MB/s
I didn't think it possible to do 10MB/s using PIO. I can do a test if you update rpi-dt-firmware or rpi-bcm2835. What numbers do you get on the vanilla Pi kernel (which uses the same driver) ? Probably best to ask Gellert about this.
I added logs and it use bcm2835_mmc_transfer_pio
and gives 11.3MB/s this is v3.17.1 with:
on top. I tried clean v3.17.1 and get 700KB/s.
MMC_BCM2835_DMA can't be compiled without DMADEVICES. Without DMA_BCM2835 I get write: 9.4MB/s and read: 15.7MB/s. If I enable DMA_BCM2835 system hangs with above Oops, this is becuse lack of slave_sg transfer mode. If I merge slave_sg, then I get write: 10.3MB/s and read: 16.1MB/s. Just for sanity check I verified if it use bcm2835_mmc_transfer_dma
and it doesn't it still use bcm2835_mmc_transfer_pio
. What is the procedure for enabling bcm2835-mmc with DMA ?
Ok. I found reason:
mmc-bcm2835: Unable to initialise DMA channels. Falling back to PIO
Above problem was caused by lack of DMA_VIRTUAL_CHANNELS
on which DMA_BCM2835
depends. When I enabled it gives me write: 11.2MB/s and read: 17MB/s. Sanity check show that it use bcm2835_mmc_transfer_dma
. Still the question is why bcm2835-mmc with slave_sg DMA is slower than PIO when writing ?
If you want to test this kernel please use mmc-dma
branch of rpi-dt-firmware
:
https://github.com/pietrushnic/rpi-dt-firmware/tree/mmc-dma
These are my numbers on a Model B+ with a Sandisk Ultra 8GB
Your kernel
$ sudo REPO_URI=https://github.com/pietrushnic/rpi-dt-firmware BRANCH=mmc-dma rpi-update && sudo reboot
$ uname -a
Linux raspberrypi 3.17.1+ #81 Sat Oct 25 23:08:18 CEST 2014 armv6l GNU/Linux
$ dmesg
[ 2.435020] DMA channels allocated for the MMC driver
[ 2.472724] Load BCM2835 MMC driver
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 35.7482 s, 14.7 MB/s
real 0m35.768s
user 0m0.000s
sys 0m16.360s
real 0m6.996s
user 0m0.000s
sys 0m0.080s
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 36.4759 s, 14.4 MB/s
real 0m36.503s
user 0m0.030s
sys 0m16.350s
real 0m5.783s
user 0m0.010s
sys 0m0.060s
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 37.5479 s, 14.0 MB/s
real 0m38.117s
user 0m0.040s
sys 0m14.910s
real 0m5.932s
user 0m0.000s
sys 0m0.060s
$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.6864 s, 18.9 MB/s
$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.6941 s, 18.9 MB/s
The offical Pi kernel using the same driver
$ sudo rpi-update && sudo reboot
$ uname -a
Linux raspberrypi 3.12.30+ #717 PREEMPT Fri Oct 17 18:46:31 BST 2014 armv6l GNU/Linux
$ dmesg
[ 2.056872] DMA channels allocated for the MMC driver
[ 2.100763] Load BCM2835 MMC driver
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 40.9618 s, 12.8 MB/s
real 0m40.987s
user 0m0.050s
sys 0m10.170s
real 0m1.908s
user 0m0.000s
sys 0m0.010s
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 38.1315 s, 13.7 MB/s
real 0m38.605s
user 0m0.020s
sys 0m10.530s
real 0m4.845s
user 0m0.000s
sys 0m0.080s
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 40.5635 s, 12.9 MB/s
real 0m41.034s
user 0m0.010s
sys 0m10.240s
real 0m3.121s
user 0m0.000s
sys 0m0.020s
$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.7841 s, 18.9 MB/s
$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.7765 s, 18.9 MB/s
Is this second result The offical Pi kernel using the same driver
is with DMA enabled ?
So we have slight improvement with slave_sg
mode, but what about your concerns about PIO ?
You said that it is impossible to get 11MB/s with PIO, so what is wrong with my configuration and why I'm getting this ?
My .config
: http://paste.ubuntu.com/8694441/
Is this second result with DMA enabled ?
Yes, as the kernel logs says:
[ 2.056872] DMA channels allocated for the MMC driver
[ 2.100763] Load BCM2835 MMC driver
It would say PIO something if DMA was not used.
So we have slight improvement with slave_sg mode, but what about your concerns about PIO ? You said that it is impossible to get 11MB/s with PIO, so what is wrong with my configuration and why I'm getting this ?
Well, I must be wrong here. But, I still don't understand how it's possible to get that kind of speed using PIO. But if DMA is faster than PIO, we don't need to find out, do we :-)
I found this stuff in BCM2835 doc:
The EMMC module restricts the maximum block size to the size of the internal data FIFO
which is 1k bytes. In order to get maximum performance for data transfers it is necessary to
use multiple block data transfers. In this case the EMMC module uses two FIFOs in ping-
pong mode, i.e. one is used to transfer data to/from the card while the other is simultaneously
accessed by DMA via the AXI bus. If the EMMC module is configured for single block
transfers only one FIFO is used, so no DMA access is possible while data is transferred
to/from the card and vice versa resulting in long dead times.
Is it possible that we have this case ? How to check if we use one or two FIFO ?
Maybe @weiszg can answer?
I talked on IRC with @rossoldfield and we get to conclusion that it is possible that we reach the CPU limit for my SD card that's why there is not improvement between PIO and DMA in case of transfer. But @notro look at sys CPU utilization for PIO I have 25s, but with DMA only 15s. In your case I see better sys CPU/worst thoughtput with RPi kernel and with my driver worst sys CPU/better throughtput. I assume that better throughtput or sys CPU time can be argument for MMC with DMA.
Vanilla RPi 3.12.28 write 10.9 MB/s:
real 0m48.003s
user 0m0.010s
sys 0m10.260s
Vanilla RPi 3.12.28 read 17.0 MB/s:
real 0m30.907s
user 0m0.030s
sys 0m4.050s
3.17.1 with bcm2835-mmc + slave_sg DMA write 11.3MB/s:
real 0m46.469s
user 0m0.000s
sys 0m13.710s
3.17.1 with bcm2835-mmc + slave_sg DMA read 17MB/s:
real 0m30.859s
user 0m0.010s
sys 0m5.640s
My biggest problem with those number is that DMA + slave_sg loads processor more then vanilla RPi. What's better in DMA for bcm2708 ?
3.17.1 with bcm2835-mmc + PIO write 11.6MB/s:
real 0m45.295s
user 0m0.040s
sys 0m26.130s
3.17.1 with bcm2835-mmc + PIO read 16.2MB/s:
real 0m32.364s
user 0m0.020s
sys 0m23.730s
It looks like ther is no free lunch. I can reach over 11.6MB/s but with much higher CPU load and drop of reads.
@notro it would be great if you can do the same tests. I will try tomorrow with some no name SD card. My main question here is: if (taking those numbers into considerations) it is worth to upstream MMC + DMA with slave_sg ?
To get a usable system we need DMA for this. Unless someone from the Foundation chimes in with some help here, I suggest just leaving the performance as it is. We can nitpick later. Further down the road when this kernel has the same feature set as the vanilla one, I expect this issue will get priority.
$ sudo REPO_URI=https://github.com/pietrushnic/rpi-dt-firmware rpi-update && sudo reboot
$ uname -a
Linux raspberrypi 3.16.6+ #22 Tue Oct 21 22:24:10 CEST 2014 armv6l GNU/Linux
$ zgrep BCM2835_DMA /proc/config.gz
# CONFIG_MMC_BCM2835_DMA is not set
$ dmesg
[ 2.526729] Forcing PIO mode
[ 2.561669] Load BCM2835 MMC driver
# top shows that the cpu runs at 100%
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024;
524288000 bytes (524 MB) copied, 37.2794 s, 14.1 MB/s
real 0m37.310s
user 0m0.080s
sys 0m30.570s
524288000 bytes (524 MB) copied, 37.4099 s, 14.0 MB/s
real 0m37.975s
user 0m0.010s
sys 0m30.920s
524288000 bytes (524 MB) copied, 40.4051 s, 13.0 MB/s
real 0m40.960s
user 0m0.020s
sys 0m30.370s
524288000 bytes (524 MB) copied, 37.1347 s, 14.1 MB/s
real 0m37.700s
user 0m0.020s
sys 0m29.690s
# top shows that the cpu runs at 100%
$ sync; time dd if=~/test.tmp of=/dev/null bs=500K count=1024
524288000 bytes (524 MB) copied, 32.3565 s, 16.2 MB/s
real 0m32.376s
user 0m0.180s
sys 0m29.090s
524288000 bytes (524 MB) copied, 32.3613 s, 16.2 MB/s
real 0m32.382s
user 0m0.100s
sys 0m29.280s
I did performance testing with this driver.
sdhci-bcm2835: write: 702 kB/s read: 14.3 MB/s bcm2835-mmc: write: 11.1 MB/s read: 15.9 MB/s bcm2835-mmc + DMA: write: 10.4 MB/s read: 17.0 MB/s
Any ideas why DMA cayse drop in write ?
I will try to repeast tests and prepare more data.