raspberrypi / usbboot

Raspberry Pi USB booting code, moved from tools repository
Apache License 2.0
905 stars 229 forks source link

the new msd/start4.elf performance seems slower then expected #68

Closed cleverca22 closed 3 years ago

cleverca22 commented 3 years ago

Describe the bug the new msd/start4.elf performance seems slower then expected

To Reproduce

# boot msd/start4.elf on a rpi4b
[root@amd-nixos:~]# time dd if=/dev/sdf of=/dev/null bs=4096
1940480+0 records in
1940480+0 records out
7948206080 bytes (7.9 GB, 7.4 GiB) copied, 718.837 s, 11.1 MB/s

real    11m58.853s

Expected behaviour the exact same uSD card, now in a laptop with a built-in mmc reader

[root@system76:~]# time dd if=/dev/mmcblk0 bs=4096 of=/dev/null
1940480+0 records in
1940480+0 records out
7948206080 bytes (7.9 GB, 7.4 GiB) copied, 185.129 s, 42.9 MB/s

real    3m5.138s

Bootloader version and configuration

[clever@amd-nixos:usbboot]$ git rev-parse HEAD
ff86c6dedebdc34382406915c44ef7933b20a177

SD card boot (please complete the following information):

[root@system76:/sys/class/block/mmcblk0/device]# cat cid
035344534c30384780956a0dae00fb01

[root@system76:/sys/class/block/mmcblk0/device]# cat csd
400e00325b5900003b377f800a404001

[root@system76:/sys/class/block/mmcblk0/device]# cat manfid
0x000003

[root@system76:/sys/class/block/mmcblk0/device]# cat name
SL08G

[root@system76:/sys/class/block/mmcblk0/device]# cat serial
0x956a0dae

a relatively recent 8gig samsung uSD card

Additional context and one minor nit-pick: Bus 006 Device 087: ID 0a5c:0001 Broadcom Corp. Compute Module the msd/start4.elf always claims to be a compute module, even if booting on an rpi4b

lurch commented 3 years ago

Might be worth booting a Pi4B from a USB drive, and then using that to benchmark the Pi's SD card slot? That'll be a fairer test than comparing against a laptop's card-reader, which may support SD modes that the Pi4B's SD card slot doesn't support.

cleverca22 commented 3 years ago
root@raspberrypi:~# time dd if=/dev/mmcblk0 of=/dev/null bs=4096
1940480+0 records in
1940480+0 records out
7948206080 bytes (7.9 GB, 7.4 GiB) copied, 181.309 s, 43.8 MB/s

real    3m1.316s
user    0m1.574s
sys     0m29.028s

rpios running on the rpi4 directly, managed to go slightly faster then the x86 laptop! so its not the pi hardware, but the code within msd/start4.elf and/or the usb link

cleverca22 commented 3 years ago
root@raspberrypi:~# cat /sys/kernel/debug/mmc0/ios
clock:          50,000,000 Hz
actual clock:   50,000,000 Hz
vdd:            21 (3.3 ~ 3.4 V)
bus mode:       2 (push-pull)
chip select:    0 (don't care)
power mode:     2 (on)
bus width:      2 (4 bits)
timing spec:    7 (sd uhs DDR50)
signal voltage: 1 (1.80 V)
driver type:    0 (driver type B)

vs the laptop:

[root@system76:~]# cat /sys/kernel/debug/mmc0/ios 
clock:          100,000,000 Hz
vdd:            21 (3.3 ~ 3.4 V)
bus mode:       2 (push-pull)
chip select:    0 (don't care)
power mode:     2 (on)
bus width:      2 (4 bits)
timing spec:    5 (sd uhs SDR50)
signal voltage: 1 (1.80 V)
driver type:    0 (driver type B)

so the pi4 with raspios, used DDR50, but the laptop used SDR100?, giving the same effective data-rate and msd/start4.elf doesnt respect uart_2ndstage=1 so i have no clue what its doing internally

and i know its asking for a bit much, but what about msd/start4.elf at least blinking the activity LED or showing logs to hdmi?

lurch commented 3 years ago

If getting maximum flashing speed is important to you, you could try updating https://github.com/balena-os/amber-etcher-kernel to work on the CM4 (IIRC that uses usbboot to load a linux-kernel-with-embedded-initrd onto the Pi, with the initrd containing scripts to expose the SD / eMMC as a USB MSD using Linux's usb-gadget mode). I think the speed-difference might be down to the Linux kernel running on the CPU, but msd/start4.elf running on the VPU (but that's just a guess).

cleverca22 commented 3 years ago

ah yeah, i can see how the 500mhz dual-core vs >1ghz quad-core, would make a fairly drastic difference

about the only way i can see to diagnose that more, would be for msd/start4.elf to print cpu usage stats to the serial if you uart_2ndstage=1, but thats pretty low priority, since you can just boot a linux based MSD instead

that only leaves 2 minor bugs:

gpio condionals in bootconf.txt might work, but ive not tried changing the BOOT_ORDER dynamically yet

update: yep, conditionals work! took https://github.com/raspberrypi/firmware/issues/1076 as an example, and put the following into bootconf.txt:

[all]
gpio=5=ip,pu

[gpio5=0]
BOOT_ORDER=0x41

[gpio5=1]
BOOT_ORDER=0x3

now it goes into usb-dev mode by default, and the self-reflashing can update without any fuss and if i jumper gpio5<->gnd, it will instead boot from SD or usb-host!

cleverca22 commented 3 years ago
[107888.744234] usb-storage 6-2:1.0: USB Mass Storage device detected
[107888.744359] usb-storage 6-2:1.0: Quirks match for vid 0525 pid a4a5: 10000
[107888.744417] scsi host10: usb-storage 6-2:1.0
[107889.766303] scsi 10:0:0:0: Direct-Access     Linux    File-Stor Gadget 0504 PQ: 0 ANSI: 2
[root@amd-nixos:~]# time dd if=/dev/sde of=/dev/null bs=4096
1940480+0 records in
1940480+0 records out
7948206080 bytes (7.9 GB, 7.4 GiB) copied, 242.689 s, 32.8 MB/s

real    4m2.697s

https://github.com/librerpi/rpi-tools/blob/master/nixos-msd/default.nix#L35

running the official kernel/modules, with a custom initrd/busybox, i was able to get ~74% the speed of a direct connection which is nearing the limits of usb2.0, but i dont know the exact implementation (10:8 encoding? overhead bytes per packet? packet size?) so i cant compute the exact limits

definitely a big improvement, and something to consider for when i need it in the future linux is also able to drive the ACT led normally, so it behaves more like a normal MSD as well

edit:

(over serial console)
/sys/devices/platform/soc/fe980000.usb/gadget/lun0 # truncate -s $((1024*1024*1024)) /disk.img
/sys/devices/platform/soc/fe980000.usb/gadget/lun0 # echo /disk.img > file
(back on host)
[root@amd-nixos:~]# time dd if=/dev/sde of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 32.7626 s, 32.8 MB/s

real    0m32.768s

and if i remove the SD card from the loop and just read a dummy disk in tmpfs, i get the same speed, definitely a bottleneck in the usb area now