Closed skylar-byte closed 3 years ago
Could you provide an example workflow / benchmark that would trigger it? What devices are connected over PCIe (SATA? NVMe? Anything on x1 or on x16 PCIe slot?)
Pretty much anything that requires heavy lifting (compiling multi-threaded). When using one core heavily such as during an archive extraction generates issues with DMA (https://bugzilla.kernel.org/show_bug.cgi?id=207095).
The issue is present with NVMe and SATA. I also have the GPU on x16 and USB on x1.
Could you try a couple more workloads to see if they cause issue?
# Disk IO mixed 4K reads/writes
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
# Graphics (disabled vsync thus will reach high fps)
vblank_mode=1 glxgears
I also get hangs, but I get a feel this is due to "high load" on PCI. Just idling (e.g. using Terminal on desktop image) works quite fine. I don't get any issues reported by the kernel in my case.
I don't have a desktop environment installed on my image, therefore I can't currently test the 3D graphics. I don't have libaio but was able to run your test with posixaio. I'm running both of these tests on a SATA drive.
5.2.0:
fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
fio-3.19
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=10.4MiB/s,w=3678KiB/s][r=2652,w=919 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=7498: Tue Apr 21 08:58:16 2020
read: IOPS=2868, BW=11.2MiB/s (11.8MB/s)(3070MiB/273948msec)
bw ( KiB/s): min= 4289, max=12667, per=100.00%, avg=11487.02, stdev=1431.47, samples=543
iops : min= 1072, max= 3166, avg=2871.40, stdev=357.87, samples=543
write: IOPS=958, BW=3835KiB/s (3927kB/s)(1026MiB/273948msec); 0 zone resets
bw ( KiB/s): min= 1373, max= 4571, per=100.00%, avg=3838.72, stdev=508.06, samples=543
iops : min= 343, max= 1142, avg=959.28, stdev=126.99, samples=543
cpu : usr=4.95%, sys=0.91%, ctx=144731, majf=0, minf=20
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.4%, 32=86.7%, >=64=12.8%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=93.6%, 8=2.4%, 16=2.5%, 32=1.4%, 64=0.1%, >=64=0.0%
issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=11.2MiB/s (11.8MB/s), 11.2MiB/s-11.2MiB/s (11.8MB/s-11.8MB/s), io=3070MiB (3219MB), run=273948-273948msec
WRITE: bw=3835KiB/s (3927kB/s), 3835KiB/s-3835KiB/s (3927kB/s-3927kB/s), io=1026MiB (1076MB), run=273948-273948msec
5.6.5:
fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=64
fio-3.19
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][57.5%][r=10.5MiB/s,w=3632KiB/s][r=2694,w=908 IOPS][eta 02m:02s]
It ran about 3 minutes then locked up. I've also noticed with the new kernel while booting it sometimes hangs on the following:
* Populating /dev with existing devices through uevents ...
Could you try on the same disk images as me? https://github.com/sifive/meta-sifive/releases/tag/2020.04.00
I looked at the 1st partition and noticed fitImage, can I build OpenSBI with this kernel? I used your image's 4th partition and had the following results.
5.2.0:
glxgears ran fine, around 400FPS.
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.17-dirty
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
test: (groupid=0, jobs=1): err= 0: pid=616: Tue Apr 21 16:53:58 2020
read: IOPS=5825, BW=22.8MiB/s (23.9MB/s)(3070MiB/134901msec)
bw ( KiB/s): min= 3656, max=26914, per=99.75%, avg=23244.81, stdev=3510.92, samples=269
iops : min= 914, max= 6728, avg=5810.94, stdev=877.70, samples=269
write: IOPS=1947, BW=7788KiB/s (7975kB/s)(1026MiB/134901msec); 0 zone resets
bw ( KiB/s): min= 1045, max= 9189, per=99.75%, avg=7768.88, stdev=1174.41, samples=269
iops : min= 261, max= 2297, avg=1941.66, stdev=293.61, samples=269
cpu : usr=8.78%, sys=87.34%, ctx=51282, majf=0, minf=13
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=22.8MiB/s (23.9MB/s), 22.8MiB/s-22.8MiB/s (23.9MB/s-23.9MB/s), io=3070MiB (3219MB), run=134901-134901msec
WRITE: bw=7788KiB/s (7975kB/s), 7788KiB/s-7788KiB/s (7975kB/s-7975kB/s), io=1026MiB (1076MB), run=134901-134901msec
5.6.5:
Froze as different times during booting, wasn't able to reach the desktop.
Hi @skylar-byte - we're investigating this internally. It might take a couple of weeks.
Hi @skylar-byte, we're coordinating this with one of our partners. Unfortunately, no firm timeline as to when something will be posted, but we'll keep this issue updated as we receive more information.
Hi @skylar-byte , There is a newer Microsemi PolarFire FPGA bitfile you can try that has helped improve the performance and reliability of the HiFive Unleashed Microsemi expansion board system. This newer bitfile has improvements in clocking and reset performance. There not one bitstream that is proven to work on all systems but this newer bitfile seems to work better for some users.
The bitfile can be found under "Firmware Versions" section table in the "Second Release" row at this web site: https://github.com/polarfire-soc/polarfire-soc-documentation/blob/master/boards/mpfs-dev-kit/MPFS-DEV-KIT_user_guide.md
Hi @JimSughrue, thanks for the tip. I loaded the bitstream on FP Express as I'm using the FP5 programmer. I ran a device info before trying to erase and program my board and was met with the following message:
Software Version: 12.200.35.9
creating folder: C:\Microsemi\Program_Debug_PolarFire_v2.3\Program_Debug_Tool\bin\HiFive Unleashed Expansion Board v2\VeraShell
programmer 'S2011JOFBA' : FlashPro5
Created new project 'C:\Microsemi\Program_Debug_PolarFire_v2.3\Program_Debug_Tool\bin\HiFive Unleashed Expansion Board v2\VeraShell\VeraShell.pro'
STAPL file 'C:\Microsemi\Program_Debug_PolarFire_v2.3\Program_Debug_Tool\bin\HiFive Unleashed Expansion Board v2\VeraShell\VeraShell.stp' has been loaded successfully.
DESIGN : VeraShell; CHECKSUM : 88E9; ALG_VERSION : 1
creating folder: C:\Microsemi\Program_Debug_PolarFire_v2.3\Program_Debug_Tool\bin\HiFive Unleashed Expansion Board v2\VeraShell\projectData
Software Version: 12.200.35.9
STAPL file 'C:\Microsemi\Program_Debug_PolarFire_v2.3\Program_Debug_Tool\bin\HiFive Unleashed Expansion Board v2\VeraShell\VeraShell.stp' has been loaded successfully.
DESIGN : VeraShell; CHECKSUM : 88E9; ALG_VERSION : 1
programmer 'S2011JOFBA' : FlashPro5
Created FlashPro Express Job Project.
programmer 'S2011JOFBA' : Scan Chain...
programmer 'S2011JOFBA' : Check Chain...
Error: programmer 'S2011JOFBA' : Device 1: Found: MPF300XT, Expected: MPF300TS_ES
Error: programmer 'S2011JOFBA' : Scan and Check Chain FAILED.
Error: Failed to run Action.
Looking around a bit I found another user with a similar issue but with no resolution. I assume it must be some sort of bug? My FPGA does have MPF300XT engraved on the top. Hopefully someone more senior like @paul-walmsley-sifive can shine some light on this
I've looked at the metadata between my original bitstream and the two available on the guide. The device and package notes match, only the dates don't.
HFU540_EXP_Bitstream_r10101.stp: NOTE "DATE" "2018/04/25";
HFU540_EXP_Bitstream_r20102_stpfile.stp: NOTE "DATE" "2018/09/11";
VeraShell.stp: NOTE "DATE" "2019/05/06";
HFU540_EXP_Bitstream_r10101.stp: NOTE "DEVICE" "MPF300TS_ES";
HFU540_EXP_Bitstream_r20102_stpfile.stp: NOTE "DEVICE" "MPF300TS_ES";
VeraShell.stp: NOTE "DEVICE" "MPF300TS_ES";
HFU540_EXP_Bitstream_r10101.stp: NOTE "PACKAGE" "MPF300TS_ES-fcg1152";
HFU540_EXP_Bitstream_r20102_stpfile.stp: NOTE "PACKAGE" "MPF300TS_ES-fcg1152";
VeraShell.stp: NOTE "PACKAGE" "MPF300TS_ES-fcg1152";
Looking around online I found the issue mentioned on the Libero release notes.
FlashPro Express - MPF300T_ES or MPF300TS_ES Programming File Fails to
Program a MPF300XT Device
In FlashPro Express PolarFire v2.2, the MPF300T_ES or MPF300TS_ES programming file cannot program a
MPF300XT device, and vice versa.
Workarounds:
1. Export a STAPL file for the target device.
2. Export a STAPL file from Libero and use standalone FlashPro on Windows in single mode to
program.
I was able to successfully program it using FP non-express and selecting the target device type. I'm currently in the process of testing the new bitstream and will report back on my findings.
@skylar-byte Could you share a picture of your FPGA package (the text on it should be readable)?
@davidlt Here is the picture requested.
I've been able to successfully compile multiple packages for a few weeks now with no issue. I believe this issue has been resolved.
I'm having an issue where I experience a hang on high CPU load such as compiling.
I've tried Linux 5.6.2 and 5.6.5 with OpenSBI 0.4 - 0.6 with the defconfig provided. I'm using the expansion board and this happens with NVMe or SATA drives. I didn't experience the issue on my previous setup (Linux 5.2.0, OpenSBI 0.4).
I've compared the device trees between both versions, old and new as it might have something to do with it.