litex-hub / linux-on-litex-vexriscv

Linux on LiteX-VexRiscv
BSD 2-Clause "Simplified" License
551 stars 174 forks source link

Linux slower as cpu-count increases #301

Open AlanVek opened 1 year ago

AlanVek commented 1 year ago

I built the project on a SmartFusion2 FPGA and was able to successfully run bios+opensbi+Linux+buildroot with cpu-count = 1, 2 and 4. However, as cpu-count increases, Linux seems to be getting slower. With cpu-count = 4, boot-up takes about 6 minutes and then it's almost unusable because of the slow response.

Because of memory limits used by other peripherals, I had to reduce both tlb size and cache size as I added more cpus. I'm wondering if that could be the issue or if there's something I'm doing wrong (maybe some software configuration).

The parameters I used were:

--cpu-count 1 --dcache-width 32 --icache-width 32 --dcache-ways 1 --icache-ways 1 --without-coherent-dma --dtlb-size 4 --itlb-size 4 --dcache-size 4096 --icache-size 4096

--cpu-count 2 --dcache-width 32 --icache-width 32 --dcache-ways 1 --icache-ways 1 --without-coherent-dma --dtlb-size 2 --itlb-size 2 --dcache-size 2048 --icache-size 2048

--cpu-count 4 --dcache-width 32 --icache-width 32 --dcache-ways 1 --icache-ways 1 --without-coherent-dma --dtlb-size 2 --itlb-size 2 --dcache-size 1024 --icache-size 1024

I guess the smaller cache/tlb would have an impact, but I can't help but think there could be another cause given the huge decrease in performance.

mithro commented 1 year ago

You might be interested in this @Dolu1990 and @kgugala

Dolu1990 commented 1 year ago

At which frequency the core is running ? It is quite possible that the system is wasting most of its time running linux ticks (250 hz by default, which is kind of high)

--dtlb-size 2 --itlb-size 2 --dcache-size 1024 --icache-size 1024 is very very very low, the CPU will spend most of their time waiting for memory reads and doing TLB refill. Is the FPGA getting short in memory blocks ?

If possible avoid getting bellow 4 dtlb 4 itlb

AlanVek commented 1 year ago

Thanks for the response! The core is running at 100 MHz. I will try to synthesize with cpu-count = 1 and lower tlb/cache to check if I get the same performance.

The FPGA is running short on memory blocks because I have a lot of other peripherals using a lot of memory. I have to check if I can get cpu-count = 4 to run with tlb size = 4.

I'll let you know the results I get with those configurations.

AlanVek commented 1 year ago

Hello, sorry for the delay. I tested the same build with the following arguments:

--cpu-count 1 --dcache-width 32 --icache-width 32 --dcache-ways 1 --icache-ways 1 --without-coherent-dma --dtlb-size 2 --itlb-size 2 --dcache-size 1024 --icache-size 1024

^ This configuration works fine (a little slower than the previous one with tlb_size=4 and cache_size=4096, but that is to be expected I guess). It takes about 45 seconds to boot.

Then I tried this configuration, following your tlb_size recommendation: --cpu-count 4 --dcache-width 32 --icache-width 32 --dcache-ways 1 --icache-ways 1 --without-coherent-dma --dtlb-size 4 --itlb-size 4 --dcache-size 1024 --icache-size 1024

^ This configuration still works as before: it takes a long time to boot (about 5-6 minutes) and then it barely works.

AlanVek commented 1 year ago

I also tried: --cpu-count 4 --dcache-width 32 --icache-width 32 --dcache-ways 1 --icache-ways 1 --without-coherent-dma --dtlb-size 4 --itlb-size 4 --dcache-size 4096 --icache-size 4096

With this configuration, boot-up went down to about 1.5 minutes and general behavior improved considerably, but I'm still getting better overall performance from builds with less cpu-count, regardless of tlb_size and cache_size.

Dolu1990 commented 1 year ago

If i remember the port for the SmartFusion2 has a very slow memory system (slow interconnect to the DDR) If that's the case, very likely reducing the linux tickrate would help avoiding too much overhead : CONFIG_HZ_100=y

AlanVek commented 1 year ago

I set CONFIG_HZ_100=y and now everything seems to be working smoothly with tlb_size = 4 and cache_size = 4096. Boot-up takes about 30 seconds and Linux itself works perfectly. Thanks so much! I'll try to run everything with a different DDR system (Fabric DDR), without having to go through the SmartFusion2 interconnect, and see if that brings further improvements.

mithro commented 1 year ago

@AlanVek - It is probably be great to document what you have been exploring here in a blog post or similar it has been very interesting.