Closed wevieee closed 1 year ago
OK, I did a little memory testing with my DDR4 settings. The ones you show should be faster. If you get linux booted on your module can you run this sysbench command and see how fast you are?
$ sysbench --test=memory run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing memory operations speed test
Memory block size: 1K
Memory transfer size: 102400M
Memory operations type: write
Memory scope type: global
Threads started!
Done.
Operations performed: 104857600 (737237.98 ops/sec)
102400.00 MB transferred (719.96 MB/sec)
Test execution summary:
total time: 142.2303s
total number of events: 104857600
total time taken by event execution: 114.7632
per-request statistics:
min: 0.00ms
avg: 0.00ms
max: 9.90ms
approx. 95 percentile: 0.00ms
Threads fairness:
events (avg/stddev): 104857600.0000/0.00
execution time (avg/stddev): 114.7632/0.00
Sure:
linaro@linaro-developer:~/sysbench/src$ uname -a
Linux linaro-developer 5.15.36-xilinx-v2022.2 #1 SMP Mon Oct 3 07:50:07 UTC 2022 aarch64 GNU/Linux
linaro@linaro-developer:~/sysbench/src$ ./sysbench memory run
sysbench 1.1.0-df89d34 (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 11530969 (1153084.24 per second)
11260.71 MiB transferred (1126.06 MiB/sec)
Throughput:
events/s (eps): 1153084.2368
time elapsed: 10.0001s
total number of events: 11530969
Latency (ms):
min: 0.00
avg: 0.00
max: 0.19
95th percentile: 0.00
sum: 3870.70
Threads fairness:
events (avg/stddev): 11530969.0000/0.00
execution time (avg/stddev): 3.8707/0.00
If you can explain the steps to generate the *.tcl files; I can create a pull-request
Hey, Thanks for running that. There is a big increase in performance, about 50%. I compared the ops/second between our runs.
N1=737237.98; N2=1153084.24; N2/N1 ans = 1.5641
These settings are contained in the system.tcl file in the source folder. I generate that file from within the Vivado GUI. With the block diagram editor open and the design verified I run the command "write_bd_tcl -force ../source/system.tcl". If you could generate a new system.tcl file I'd like to try it on my Alinx setup. This is cool stufff.
I have opened a pull-request with my changes (https://github.com/hdlguy/alinx/pull/2)
Hey Wevieee, I saw your pull request. I want to get your changes into my repository. Thank you.
I just cloned your fork and I am compiling the FPGA and rebuilding Petalinux to make sure it works as I expect it will. I'm not really too familiar with such things on GitHub but I will figure it out.
Uh oh, I ran the same memory test with your DDR4 settings and I actually get less MB/s than before. It is hard to explain this. Maybe the kernel version changed because I compiled with Petalinux 2022.2. Anyway, my kernel matches yours.
Any ideas?
` $ uname -a Linux linaro-developer 5.15.36-xilinx-v2022.2 #1 SMP Mon Oct 3 07:50:07 UTC 2022 aarch64 GNU/Linux
~$ sysbench --test=memory run --max-time=40 sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options: Number of threads: 1
Doing memory operations speed test Memory block size: 1K
Memory transfer size: 102400M
Memory operations type: write Memory scope type: global Threads started! Time limit exceeded, exiting... Done.
Operations performed: 24301748 (607540.21 ops/sec)
23732.18 MB transferred (593.30 MB/sec)
Test execution summary: total time: 40.0002s total number of events: 24301748 total time taken by event execution: 29.5463 per-request statistics: min: 0.00ms avg: 0.00ms max: 0.10ms approx. 95 percentile: 0.00ms
Threads fairness: events (avg/stddev): 24301748.0000/0.00 execution time (avg/stddev): 29.5463/0.00 `
This is repeatable. I switched back to the previous BOOT.BIN and got these results.
Are you running on a module with a xczu2cg-sfvc784 chip?
Out of curiosity, I ran that sysbench command on my Linux desktop and got 8223 MB/s, more than 10 times faster.
` $ uname -a Linux linaro-developer 5.15.19-xilinx-v2022.1 #1 SMP Thu May 12 09:05:30 UTC 2022 aarch64 GNU/Linux
$ sysbench --test=memory run sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options: Number of threads: 1
Doing memory operations speed test Memory block size: 1K
Memory transfer size: 102400M
Memory operations type: write Memory scope type: global Threads started! Done.
Operations performed: 104857600 (744678.80 ops/sec)
102400.00 MB transferred (727.23 MB/sec)
Test execution summary: total time: 140.8092s total number of events: 104857600 total time taken by event execution: 113.3516 per-request statistics: min: 0.00ms avg: 0.00ms max: 9.34ms approx. 95 percentile: 0.00ms
Threads fairness: events (avg/stddev): 104857600.0000/0.00 execution time (avg/stddev): 113.3516/0.00
`
Oh wow.
I assume you did generate a completely new design? With updated PLL parameters etc? Yes, I have also a XCZU2CG-1SFVC784E.
The fact that you had to supply a timelimit doesn't look too good ("Time limit exceeded, exiting...")
Can you put your "slow" BOOT.BIN on a branch? So I can test that one?
Ok, GitHub complained about the large file but it looks like it accepted them. The files I got from generating petalinux in your fork are committed here (back in my repo).
https://github.com/hdlguy/alinx/tree/main/petalinux/slowboot
These are the three files that I copy onto the SD Card.
$ ls -ltrah total 72M -rw-rw-r-- 1 pedro pedro 63M Jan 9 19:16 BOOT.BIN -rw-rw-r-- 1 pedro pedro 2.8K Jan 9 19:16 boot.scr drwxrwxr-x 2 pedro pedro 4.0K Jan 9 19:16 . -rw-rw-r-- 1 pedro pedro 9.0M Jan 9 19:16 image.ub drwxrwxr-x 3 pedro pedro 4.0K Jan 9 19:16 ..
To be clear, I cloned your fork. Then I recompiled the fpga using the setup.tcl and compile.tcl scripts. I opened the Vivado project and verified that the Zynq DDR4 was running DDR2400. Then I followed the instructions in the petalinux/readme.md file to create new boot files. I copied those files to the BOOT partition on the SD card, then booted with that card.
Interesting, I also notice a difference, but not anywhere near the poor performance on your board.
linaro@linaro-developer:~/sysbench/src$ ./sysbench memory run
sysbench 1.1.0-df89d34 (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 10232647 (1023256.51 per second)
9992.82 MiB transferred (999.27 MiB/sec)
Throughput:
events/s (eps): 1023256.5102
time elapsed: 10.0001s
total number of events: 10232647
Latency (ms):
min: 0.00
avg: 0.00
max: 0.27
95th percentile: 0.00
sum: 3904.83
Threads fairness:
events (avg/stddev): 10232647.0000/0.00
execution time (avg/stddev): 3.9048/0.00
Where did you get your sysbench binary? Included in the repo you are using? I compiled mine from source. We maybe need to double check the DDR4 chip numbers?
Maybe, my numbers are lower than yours because I have some stuff running in the background, like an Apache2 web server. I could be running other daemons as well. This is probably not an apples-to-apples comparison.
Still, I will merge in your fork. DDR2400 has to be the correct memory speed.
From: wevieee @.> Sent: Wednesday, January 11, 2023 1:55 PM To: hdlguy/alinx @.> Cc: HDLGuy @.>; Comment @.> Subject: Re: [hdlguy/alinx] Update DDR4 timing parameters according to reference implementation (Issue #1)
Interesting, I also notice a difference, but not anywhere near the poor performance on your board.
@.***:~/sysbench/src$ ./sysbench memory run sysbench 1.1.0-df89d34 (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options: Number of threads: 1 Initializing random number generator from current time
Running memory speed test with the following options: block size: 1KiB total size: 102400MiB operation: write scope: global
Initializing worker threads...
Threads started!
Total operations: 10232647 (1023256.51 per second)
9992.82 MiB transferred (999.27 MiB/sec)
Throughput: events/s (eps): 1023256.5102 time elapsed: 10.0001s total number of events: 10232647
Latency (ms): min: 0.00 avg: 0.00 max: 0.27 95th percentile: 0.00 sum: 3904.83
Threads fairness: events (avg/stddev): 10232647.0000/0.00 execution time (avg/stddev): 3.9048/0.00
Where did you get your sysbench binary? Included in the repo you are using? I compiled mine from source. We maybe need to double check the DDR4 chip numbers?
— Reply to this email directly, view it on GitHubhttps://github.com/hdlguy/alinx/issues/1#issuecomment-1379473765, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB6I7Z3JNFE4TVMT2MTRSIDWR4M55ANCNFSM6AAAAAATRIB2LA. You are receiving this because you commented.Message ID: @.***>
I merged the pull request and got this at the top of the git log.
commit 439f23728d665cd653a8a17d5376cd0c3a0bb7cb Author: xxxxx xxxxx xxxxxxx@gmail.com Date: Sun Jan 8 18:26:38 2023 +0100
Update to PLL and DDR4 parameters from Alinx reference implementation
I am hoping to use a module like this for our next project at my day job. It is an instrument that needs to have a web interface. At first I looked at the Xilinx Kria module but that is mixed up with their so called AI flow. It is difficult to get control of the boot process. Also, the board to board connectors are exotic BGA devices.
There is only one thing I don't like about the ALINX modules. The board to board conntectors are physically off grid. The mounting holes and board dimensions are round mm dimensions but the connectors have dimensions like 3.572618945 mm. My guess is that they got nudged off grid during layout.
Anyway, If you decide to use the ALINX module I would like to collaborate in any way.
Regards,
Pete
Hi,
Very nice project; thank you for your work.
After delivery of my board I was able to get a reference Vivado project for the AXU2CG by Alinx. I noticed differences compared to the DDR4 timing parameters.
I attached a screenshot with Alinx-configured parameters.