threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
84 stars 14 forks source link

ZOS disastrous performance on PCIE 4 NVME SSD #1467

Open archit3kt opened 2 years ago

archit3kt commented 2 years ago

Hello, following this forum post with no news since a month, I thought it will be a better idea to create this issue here.

Quick summary, ZOS is having terrible PCIE4 SSD performance issue. Here are some fio tests results on actual ZoS :

Random read 4k blocks : 12.4 Mb/s 4142 IOPS Random write 4k blocks : 13.3 Mb/s 4489 IOPS Sequential read 2MB blocks : 1316 Mb/s 864 IOPS Sequential write 2MB blocks : 2326 Mb/s 1528 IOPS

Made some tests on the same machine with Ubuntu 20 and kernel 5.4, same results.

Hopefully performance is very good on Ubuntu when switching to 5.10.x kernels :+1:

Random read 4k blocks : 1855 Mb/s 488 000 IOPS Random write 4k blocks : 563 Mb/s 144 000 IOPS Sequential read 2MB blocks : 6728 Mb/s 3360 IOPS Sequential write 2MB blocks : 6271 Mb/s 3132 IOPS

This answer was given to me :

"It’s not kernel related, if you run fio on your root filesystem of the container, you hit 0-fs , which is not made to be fast, specially for random read/write.

I got it, 0-fs is not meant to be fast, but this slow would still be a big problem for a computer which only have one container running... I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.

Could you have a look please ? I cannot start hosting productions workload with such terrible IO performance...

archit3kt commented 2 years ago

New tests made on zos v3.0.1-rc3, better but still way below I should get :

Tests are done on a rootfs of Ubuntu zMachine :

Random read 4k blocks : 790 Mb/s 200 000 IOPS Random write 4k blocks : 116 Mb/s 30 000 IOPS Sequential read 2MB blocks : 1850 Mb/s 900 IOPS Sequential write 2MB blocks : 900 Mb/s 450 IOPS

Note performance regression on sequential write...

Tests on disks added to the zMachine and mounted on /data :

Random read 4k blocks : 630 Mb/s 160 000 IOPS Random write 4k blocks : 190 Mb/s 50 000 IOPS Sequential read 2MB blocks : 1200 Mb/s 600 IOPS Sequential write 2MB blocks : 290 Mb/s 140 IOPS

It doesn't make sense ! If added disk should get the native NVME SSD performance, there is clearly a problem somewhere ! Could someone please explain how the storage framework on zos v3 works ?

xmonader commented 2 years ago

@maxux please take a look at it

muhamadazmy commented 2 years ago

I tried to deploy a flist with a data container mounted at /data and ran again fio, results were strictly identical. I'm pretty sure the issue is kernel related.

Just to be clear about this part, what you mean is that you mounted a volume under the container /data then ran the fio tests on this location /data ?

muhamadazmy commented 2 years ago

For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.

What happen behind the scene for V3:

So IO operations go through this,

Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer

archit3kt commented 2 years ago

Just to be clear about this part, what you mean is that you mounted a volume under the container /data then ran the fio tests on this location /data ?

Yes, you got it

Of course there is a lot of room for improvement, for example use of logical volumes on host so write operation on host are directly sent to physical disk not to another btrfs layer

Thanks for the explanation. Indeed the architectural choice you made is not the best for IO performance ! It would be great to allow logical volume creation and mount inside the VMs (at least for power users who'd like to get all the performance from their hardware). I would be glad to be a tester for this use case !

For V3 all container workloads are virtualized, it means all IO is actually going through virtio driver. This explain the drop in performance.

If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?

muhamadazmy commented 2 years ago

If I get it correctly, every ZOS deployment will be VMs in v3 (like k3s) and containers should be deployed in the virtualized k3s ?

Yes, ZOS has a unified workload type called ZMACHINE which is always (under the hood) a VM. If your flist is a container (let's say an ubuntu flist) we inject a custom build kernel+initramfs and still start the "container" as a full VM. this insure 100% separation from the ZOS host, and control over amount of CPU+Memory allocated to your resource. For the user he still can perfectly access and run his processes inside this "container" normally.

When you start a k8s node on zos, it's basically a well crafted "flist" with the k8s well configured and ready to start. for ZOS it's just another VM that it runs same way as a container (this makes code much simpler)

maxux commented 2 years ago

Which image do you run exactly ? Default zos runs a kernel 5.4, there is a 5.10 also available. Can you give me the nodded ?

archit3kt commented 2 years ago

My first post was done with kernel 5.4 in grid v2 My second post was done with latest zos for grid v3, I saw kernel 5.12 inside the VMs

node id is 68, IP is 2a02:842a:84c8:c601:d250:99ff:fedf:924d (ICMP is blocked, but IPv6 firewall allows everything else)

maxux commented 2 years ago

I confirm, your node is running 5.10.55 kernel, which is the latest we support officially. The limitation probably the VM like Azmy said yep.

archit3kt commented 2 years ago

FYI I automated my fio tests and launched it simultaneously on X ubuntu VMs

Each 4 VMs have exactly the same results as a launch with only 1 VM

I see a degradation of performance per VM when I launch the test on 8 VM

My guess is that it is a vfio limitation, could be good to know if you make some performance tweaks someday

Still, sequential write is disastrous with vfio, and I don't have a clue why...

despiegk commented 2 years ago

this will have to wait, we have other things to first do.

amandacaster commented 1 year ago

Hello Team, can we have an update on this, please?