pietrushnic / rpi-dt-linux

This repository aims to handle all patches required for Raspberry Pi support in upstream Linux kernel.
Other
2 stars 0 forks source link

slave_sg and sdhci-bcm2835 replacement with bcm2835-mmc fixes #4 #5

Closed pietrushnic closed 9 years ago

pietrushnic commented 9 years ago

I did performance testing with this driver.

sdhci-bcm2835: write: 702 kB/s read: 14.3 MB/s bcm2835-mmc: write: 11.1 MB/s read: 15.9 MB/s bcm2835-mmc + DMA: write: 10.4 MB/s read: 17.0 MB/s

Any ideas why DMA cayse drop in write ?

I will try to repeast tests and prepare more data.

notro commented 9 years ago

I did some simple tests here: https://github.com/raspberrypi/linux/pull/652#issuecomment-52702988 How do you get 11.1MB/s write speed without DMA? Sounds impossible to me.

pietrushnic commented 9 years ago

@notro by +DMA I mean bcm2835-dma.c with slave_sg. I will post configs to show which exact options I used.

pietrushnic commented 9 years ago

Two tests that I performed with logs:

pi@raspberrypi ~ $ zgrep ARCH_BCM2 /proc/config.gz
CONFIG_ARCH_BCM2835=y
pi@raspberrypi ~ $ uname -a
Linux raspberrypi 3.17.1+ #59 Sat Oct 25 01:54:16 CEST 2014 armv6l GNU/Linux
pi@raspberrypi ~ $ zcat /proc/config.gz |grep -E "DMA|MMC"|grep -v "not set"
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_ARM_DMA_MEM_BUFFERABLE=y
CONFIG_ZONE_DMA_FLAG=0
CONFIG_SCSI_DMA=y
CONFIG_MMC=y
# MMC/SD/SDIO Card Drivers
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
# MMC/SD/SDIO Host Controller Drivers
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PLTFM=y
CONFIG_MMC_BCM2835=y
CONFIG_HAS_DMA=y
pi@raspberrypi ~ $ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 47.5086 s, 11.0 MB/s

real    0m47.730s
user    0m0.030s
sys     0m25.980s

real    0m8.991s
user    0m0.000s
sys     0m0.130s
pi@raspberrypi ~ $ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 33.2664 s, 15.8 MB/s

and DMA with slave_sg transfer mode

pi@raspberrypi ~ $ zgrep ARCH_BCM2 /proc/config.gz
CONFIG_ARCH_BCM2835=y
pi@raspberrypi ~ $ uname -a
Linux raspberrypi 3.17.1+ #61 Sat Oct 25 02:08:29 CEST 2014 armv6l GNU/Linux
pi@raspberrypi ~ $ zcat /proc/config.gz |grep -E "DMA|MMC"|grep -v "not set"
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_ARM_DMA_MEM_BUFFERABLE=y
CONFIG_ZONE_DMA_FLAG=0
CONFIG_SCSI_DMA=y
CONFIG_MMC=y
# MMC/SD/SDIO Card Drivers
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
# MMC/SD/SDIO Host Controller Drivers
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PLTFM=y
CONFIG_MMC_BCM2835=y
CONFIG_MMC_BCM2835_DMA=y
CONFIG_MMC_BCM2835_PIO_DMA_BARRIER=2
CONFIG_DMADEVICES=y
# DMA Devices
CONFIG_DMA_BCM2835=y
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_OF=y
# DMA Clients
CONFIG_HAS_DMA=y
pi@raspberrypi ~ $ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 49.1707 s, 10.7 MB/s

real    0m49.196s
user    0m0.030s
sys     0m15.270s

real    0m13.054s
user    0m0.000s
sys     0m0.070s
pi@raspberrypi ~ $ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 31.1292 s, 16.8 MB/s
notro commented 9 years ago

I didn't think it possible to do 10MB/s using PIO. I can do a test if you update rpi-dt-firmware or rpi-bcm2835. What numbers do you get on the vanilla Pi kernel (which uses the same driver) ? Probably best to ask Gellert about this.

pietrushnic commented 9 years ago

I added logs and it use bcm2835_mmc_transfer_pio and gives 11.3MB/s this is v3.17.1 with:

on top. I tried clean v3.17.1 and get 700KB/s.

MMC_BCM2835_DMA can't be compiled without DMADEVICES. Without DMA_BCM2835 I get write: 9.4MB/s and read: 15.7MB/s. If I enable DMA_BCM2835 system hangs with above Oops, this is becuse lack of slave_sg transfer mode. If I merge slave_sg, then I get write: 10.3MB/s and read: 16.1MB/s. Just for sanity check I verified if it use bcm2835_mmc_transfer_dma and it doesn't it still use bcm2835_mmc_transfer_pio. What is the procedure for enabling bcm2835-mmc with DMA ?

pietrushnic commented 9 years ago

Ok. I found reason:

mmc-bcm2835: Unable to initialise DMA channels. Falling back to PIO
pietrushnic commented 9 years ago

Above problem was caused by lack of DMA_VIRTUAL_CHANNELS on which DMA_BCM2835 depends. When I enabled it gives me write: 11.2MB/s and read: 17MB/s. Sanity check show that it use bcm2835_mmc_transfer_dma. Still the question is why bcm2835-mmc with slave_sg DMA is slower than PIO when writing ?

If you want to test this kernel please use mmc-dma branch of rpi-dt-firmware: https://github.com/pietrushnic/rpi-dt-firmware/tree/mmc-dma

notro commented 9 years ago

These are my numbers on a Model B+ with a Sandisk Ultra 8GB

Your kernel

$ sudo REPO_URI=https://github.com/pietrushnic/rpi-dt-firmware BRANCH=mmc-dma rpi-update && sudo reboot

$ uname -a
Linux raspberrypi 3.17.1+ #81 Sat Oct 25 23:08:18 CEST 2014 armv6l GNU/Linux

$ dmesg
[    2.435020] DMA channels allocated for the MMC driver
[    2.472724] Load BCM2835 MMC driver

$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 35.7482 s, 14.7 MB/s

real    0m35.768s
user    0m0.000s
sys     0m16.360s

real    0m6.996s
user    0m0.000s
sys     0m0.080s

$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 36.4759 s, 14.4 MB/s

real    0m36.503s
user    0m0.030s
sys     0m16.350s

real    0m5.783s
user    0m0.010s
sys     0m0.060s

$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 37.5479 s, 14.0 MB/s

real    0m38.117s
user    0m0.040s
sys     0m14.910s

real    0m5.932s
user    0m0.000s
sys     0m0.060s

$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.6864 s, 18.9 MB/s

$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.6941 s, 18.9 MB/s

The offical Pi kernel using the same driver

$ sudo rpi-update && sudo reboot

$ uname -a
Linux raspberrypi 3.12.30+ #717 PREEMPT Fri Oct 17 18:46:31 BST 2014 armv6l GNU/Linux

$ dmesg
[    2.056872] DMA channels allocated for the MMC driver
[    2.100763] Load BCM2835 MMC driver

$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 40.9618 s, 12.8 MB/s

real    0m40.987s
user    0m0.050s
sys     0m10.170s

real    0m1.908s
user    0m0.000s
sys     0m0.010s

$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 38.1315 s, 13.7 MB/s

real    0m38.605s
user    0m0.020s
sys     0m10.530s

real    0m4.845s
user    0m0.000s
sys     0m0.080s

$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024; time sync
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 40.5635 s, 12.9 MB/s

real    0m41.034s
user    0m0.010s
sys     0m10.240s

real    0m3.121s
user    0m0.000s
sys     0m0.020s

$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.7841 s, 18.9 MB/s

$ dd if=~/test.tmp of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 27.7765 s, 18.9 MB/s
pietrushnic commented 9 years ago

Is this second result The offical Pi kernel using the same driver is with DMA enabled ?

So we have slight improvement with slave_sg mode, but what about your concerns about PIO ? You said that it is impossible to get 11MB/s with PIO, so what is wrong with my configuration and why I'm getting this ?

My .config: http://paste.ubuntu.com/8694441/

notro commented 9 years ago

Is this second result with DMA enabled ?

Yes, as the kernel logs says:

[    2.056872] DMA channels allocated for the MMC driver
[    2.100763] Load BCM2835 MMC driver

It would say PIO something if DMA was not used.

So we have slight improvement with slave_sg mode, but what about your concerns about PIO ? You said that it is impossible to get 11MB/s with PIO, so what is wrong with my configuration and why I'm getting this ?

Well, I must be wrong here. But, I still don't understand how it's possible to get that kind of speed using PIO. But if DMA is faster than PIO, we don't need to find out, do we :-)

pietrushnic commented 9 years ago

I found this stuff in BCM2835 doc:

The EMMC module restricts the maximum block size to the size of the internal data FIFO
which is 1k bytes. In order to get maximum performance for data transfers it is necessary to
use multiple block data transfers. In this case the EMMC module uses two FIFOs in ping-
pong mode, i.e. one is used to transfer data to/from the card while the other is simultaneously
accessed by DMA via the AXI bus. If the EMMC module is configured for single block
transfers only one FIFO is used, so no DMA access is possible while data is transferred
to/from the card and vice versa resulting in long dead times.

Is it possible that we have this case ? How to check if we use one or two FIFO ?

notro commented 9 years ago

Maybe @weiszg can answer?

pietrushnic commented 9 years ago

I talked on IRC with @rossoldfield and we get to conclusion that it is possible that we reach the CPU limit for my SD card that's why there is not improvement between PIO and DMA in case of transfer. But @notro look at sys CPU utilization for PIO I have 25s, but with DMA only 15s. In your case I see better sys CPU/worst thoughtput with RPi kernel and with my driver worst sys CPU/better throughtput. I assume that better throughtput or sys CPU time can be argument for MMC with DMA.

pietrushnic commented 9 years ago

Vanilla RPi 3.12.28 write 10.9 MB/s:

real    0m48.003s
user    0m0.010s
sys     0m10.260s

Vanilla RPi 3.12.28 read 17.0 MB/s:

real    0m30.907s
user    0m0.030s
sys     0m4.050s

3.17.1 with bcm2835-mmc + slave_sg DMA write 11.3MB/s:

real    0m46.469s
user    0m0.000s
sys     0m13.710s

3.17.1 with bcm2835-mmc + slave_sg DMA read 17MB/s:

real    0m30.859s
user    0m0.010s
sys     0m5.640s

My biggest problem with those number is that DMA + slave_sg loads processor more then vanilla RPi. What's better in DMA for bcm2708 ?

3.17.1 with bcm2835-mmc + PIO write 11.6MB/s:

real    0m45.295s
user    0m0.040s
sys     0m26.130s

3.17.1 with bcm2835-mmc + PIO read 16.2MB/s:


real    0m32.364s
user    0m0.020s
sys     0m23.730s

It looks like ther is no free lunch. I can reach over 11.6MB/s but with much higher CPU load and drop of reads.

@notro it would be great if you can do the same tests. I will try tomorrow with some no name SD card. My main question here is: if (taking those numbers into considerations) it is worth to upstream MMC + DMA with slave_sg ?

notro commented 9 years ago

To get a usable system we need DMA for this. Unless someone from the Foundation chimes in with some help here, I suggest just leaving the performance as it is. We can nitpick later. Further down the road when this kernel has the same feature set as the vanilla one, I expect this issue will get priority.

$ sudo REPO_URI=https://github.com/pietrushnic/rpi-dt-firmware rpi-update && sudo reboot

$ uname -a
Linux raspberrypi 3.16.6+ #22 Tue Oct 21 22:24:10 CEST 2014 armv6l GNU/Linux

$ zgrep BCM2835_DMA /proc/config.gz
# CONFIG_MMC_BCM2835_DMA is not set

$ dmesg
[    2.526729] Forcing PIO mode
[    2.561669] Load BCM2835 MMC driver

# top shows that the cpu runs at 100%
$ sync; time dd if=/dev/zero of=~/test.tmp bs=500K count=1024;

524288000 bytes (524 MB) copied, 37.2794 s, 14.1 MB/s
real    0m37.310s
user    0m0.080s
sys     0m30.570s

524288000 bytes (524 MB) copied, 37.4099 s, 14.0 MB/s
real    0m37.975s
user    0m0.010s
sys     0m30.920s

524288000 bytes (524 MB) copied, 40.4051 s, 13.0 MB/s
real    0m40.960s
user    0m0.020s
sys     0m30.370s

524288000 bytes (524 MB) copied, 37.1347 s, 14.1 MB/s
real    0m37.700s
user    0m0.020s
sys     0m29.690s

# top shows that the cpu runs at 100%
$ sync; time dd if=~/test.tmp of=/dev/null bs=500K count=1024

524288000 bytes (524 MB) copied, 32.3565 s, 16.2 MB/s
real    0m32.376s
user    0m0.180s
sys     0m29.090s

524288000 bytes (524 MB) copied, 32.3613 s, 16.2 MB/s
real    0m32.382s
user    0m0.100s
sys     0m29.280s