hdlguy / alinx

some files to test the new ALINX AXU2CG-E development board
MIT License
5 stars 2 forks source link

Update DDR4 timing parameters according to reference implementation #1

Closed wevieee closed 1 year ago

wevieee commented 1 year ago

Hi,

Very nice project; thank you for your work.

After delivery of my board I was able to get a reference Vivado project for the AXU2CG by Alinx. I noticed differences compared to the DDR4 timing parameters.

I attached a screenshot with Alinx-configured parameters.

Screenshot from 2023-01-04 22-39-08

hdlguy commented 1 year ago

OK, I did a little memory testing with my DDR4 settings. The ones you show should be faster. If you get linux booted on your module can you run this sysbench command and see how fast you are?

$ sysbench --test=memory run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 102400M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 104857600 (737237.98 ops/sec)

102400.00 MB transferred (719.96 MB/sec)

Test execution summary:
    total time:                          142.2303s
    total number of events:              104857600
    total time taken by event execution: 114.7632
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  9.90ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           104857600.0000/0.00
    execution time (avg/stddev):   114.7632/0.00
wevieee commented 1 year ago

Sure:

linaro@linaro-developer:~/sysbench/src$ uname -a
Linux linaro-developer 5.15.36-xilinx-v2022.2 #1 SMP Mon Oct 3 07:50:07 UTC 2022 aarch64 GNU/Linux
linaro@linaro-developer:~/sysbench/src$ ./sysbench memory run
sysbench 1.1.0-df89d34 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 11530969 (1153084.24 per second)

11260.71 MiB transferred (1126.06 MiB/sec)

Throughput:
    events/s (eps):                      1153084.2368
    time elapsed:                        10.0001s
    total number of events:              11530969

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.19
         95th percentile:                        0.00
         sum:                                 3870.70

Threads fairness:
    events (avg/stddev):           11530969.0000/0.00
    execution time (avg/stddev):   3.8707/0.00

If you can explain the steps to generate the *.tcl files; I can create a pull-request

hdlguy commented 1 year ago

Hey, Thanks for running that. There is a big increase in performance, about 50%. I compared the ops/second between our runs.

N1=737237.98; N2=1153084.24; N2/N1 ans = 1.5641

These settings are contained in the system.tcl file in the source folder. I generate that file from within the Vivado GUI. With the block diagram editor open and the design verified I run the command "write_bd_tcl -force ../source/system.tcl". If you could generate a new system.tcl file I'd like to try it on my Alinx setup. This is cool stufff.

wevieee commented 1 year ago

I have opened a pull-request with my changes (https://github.com/hdlguy/alinx/pull/2)

hdlguy commented 1 year ago

Hey Wevieee, I saw your pull request. I want to get your changes into my repository. Thank you.

I just cloned your fork and I am compiling the FPGA and rebuilding Petalinux to make sure it works as I expect it will. I'm not really too familiar with such things on GitHub but I will figure it out.

hdlguy commented 1 year ago

Uh oh, I ran the same memory test with your DDR4 settings and I actually get less MB/s than before. It is hard to explain this. Maybe the kernel version changed because I compiled with Petalinux 2022.2. Anyway, my kernel matches yours.

Any ideas?

` $ uname -a Linux linaro-developer 5.15.36-xilinx-v2022.2 #1 SMP Mon Oct 3 07:50:07 UTC 2022 aarch64 GNU/Linux

~$ sysbench --test=memory run --max-time=40 sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options: Number of threads: 1

Doing memory operations speed test Memory block size: 1K

Memory transfer size: 102400M

Memory operations type: write Memory scope type: global Threads started! Time limit exceeded, exiting... Done.

Operations performed: 24301748 (607540.21 ops/sec)

23732.18 MB transferred (593.30 MB/sec)

Test execution summary: total time: 40.0002s total number of events: 24301748 total time taken by event execution: 29.5463 per-request statistics: min: 0.00ms avg: 0.00ms max: 0.10ms approx. 95 percentile: 0.00ms

Threads fairness: events (avg/stddev): 24301748.0000/0.00 execution time (avg/stddev): 29.5463/0.00 `

hdlguy commented 1 year ago

This is repeatable. I switched back to the previous BOOT.BIN and got these results.

Are you running on a module with a xczu2cg-sfvc784 chip?

Out of curiosity, I ran that sysbench command on my Linux desktop and got 8223 MB/s, more than 10 times faster.

` $ uname -a Linux linaro-developer 5.15.19-xilinx-v2022.1 #1 SMP Thu May 12 09:05:30 UTC 2022 aarch64 GNU/Linux

$ sysbench --test=memory run sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options: Number of threads: 1

Doing memory operations speed test Memory block size: 1K

Memory transfer size: 102400M

Memory operations type: write Memory scope type: global Threads started! Done.

Operations performed: 104857600 (744678.80 ops/sec)

102400.00 MB transferred (727.23 MB/sec)

Test execution summary: total time: 140.8092s total number of events: 104857600 total time taken by event execution: 113.3516 per-request statistics: min: 0.00ms avg: 0.00ms max: 9.34ms approx. 95 percentile: 0.00ms

Threads fairness: events (avg/stddev): 104857600.0000/0.00 execution time (avg/stddev): 113.3516/0.00

`

wevieee commented 1 year ago

Oh wow.

I assume you did generate a completely new design? With updated PLL parameters etc? Yes, I have also a XCZU2CG-1SFVC784E.

The fact that you had to supply a timelimit doesn't look too good ("Time limit exceeded, exiting...")

Can you put your "slow" BOOT.BIN on a branch? So I can test that one?

hdlguy commented 1 year ago

Ok, GitHub complained about the large file but it looks like it accepted them. The files I got from generating petalinux in your fork are committed here (back in my repo).

https://github.com/hdlguy/alinx/tree/main/petalinux/slowboot

These are the three files that I copy onto the SD Card.

$ ls -ltrah total 72M -rw-rw-r-- 1 pedro pedro 63M Jan 9 19:16 BOOT.BIN -rw-rw-r-- 1 pedro pedro 2.8K Jan 9 19:16 boot.scr drwxrwxr-x 2 pedro pedro 4.0K Jan 9 19:16 . -rw-rw-r-- 1 pedro pedro 9.0M Jan 9 19:16 image.ub drwxrwxr-x 3 pedro pedro 4.0K Jan 9 19:16 ..

To be clear, I cloned your fork. Then I recompiled the fpga using the setup.tcl and compile.tcl scripts. I opened the Vivado project and verified that the Zynq DDR4 was running DDR2400. Then I followed the instructions in the petalinux/readme.md file to create new boot files. I copied those files to the BOOT partition on the SD card, then booted with that card.

wevieee commented 1 year ago

Interesting, I also notice a difference, but not anywhere near the poor performance on your board.

linaro@linaro-developer:~/sysbench/src$ ./sysbench memory run
sysbench 1.1.0-df89d34 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 10232647 (1023256.51 per second)

9992.82 MiB transferred (999.27 MiB/sec)

Throughput:
    events/s (eps):                      1023256.5102
    time elapsed:                        10.0001s
    total number of events:              10232647

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.27
         95th percentile:                        0.00
         sum:                                 3904.83

Threads fairness:
    events (avg/stddev):           10232647.0000/0.00
    execution time (avg/stddev):   3.9048/0.00

Where did you get your sysbench binary? Included in the repo you are using? I compiled mine from source. We maybe need to double check the DDR4 chip numbers?

hdlguy commented 1 year ago

Maybe, my numbers are lower than yours because I have some stuff running in the background, like an Apache2 web server. I could be running other daemons as well. This is probably not an apples-to-apples comparison.

Still, I will merge in your fork. DDR2400 has to be the correct memory speed.


From: wevieee @.> Sent: Wednesday, January 11, 2023 1:55 PM To: hdlguy/alinx @.> Cc: HDLGuy @.>; Comment @.> Subject: Re: [hdlguy/alinx] Update DDR4 timing parameters according to reference implementation (Issue #1)

Interesting, I also notice a difference, but not anywhere near the poor performance on your board.

@.***:~/sysbench/src$ ./sysbench memory run sysbench 1.1.0-df89d34 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options: Number of threads: 1 Initializing random number generator from current time

Running memory speed test with the following options: block size: 1KiB total size: 102400MiB operation: write scope: global

Initializing worker threads...

Threads started!

Total operations: 10232647 (1023256.51 per second)

9992.82 MiB transferred (999.27 MiB/sec)

Throughput: events/s (eps): 1023256.5102 time elapsed: 10.0001s total number of events: 10232647

Latency (ms): min: 0.00 avg: 0.00 max: 0.27 95th percentile: 0.00 sum: 3904.83

Threads fairness: events (avg/stddev): 10232647.0000/0.00 execution time (avg/stddev): 3.9048/0.00

Where did you get your sysbench binary? Included in the repo you are using? I compiled mine from source. We maybe need to double check the DDR4 chip numbers?

— Reply to this email directly, view it on GitHubhttps://github.com/hdlguy/alinx/issues/1#issuecomment-1379473765, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB6I7Z3JNFE4TVMT2MTRSIDWR4M55ANCNFSM6AAAAAATRIB2LA. You are receiving this because you commented.Message ID: @.***>

hdlguy commented 1 year ago

I merged the pull request and got this at the top of the git log.

commit 439f23728d665cd653a8a17d5376cd0c3a0bb7cb Author: xxxxx xxxxx xxxxxxx@gmail.com Date: Sun Jan 8 18:26:38 2023 +0100

Update to PLL and DDR4 parameters from Alinx reference implementation

I am hoping to use a module like this for our next project at my day job. It is an instrument that needs to have a web interface. At first I looked at the Xilinx Kria module but that is mixed up with their so called AI flow. It is difficult to get control of the boot process. Also, the board to board connectors are exotic BGA devices.

There is only one thing I don't like about the ALINX modules. The board to board conntectors are physically off grid. The mounting holes and board dimensions are round mm dimensions but the connectors have dimensions like 3.572618945 mm. My guess is that they got nudged off grid during layout.

Anyway, If you decide to use the ALINX module I would like to collaborate in any way.

Regards,

Pete