pmem / ndctl

A "device memory" enabling project encompassing tools and libraries for CXL, NVDIMMs, DAX, memory tiering and other platform memory device topics.
Other
262 stars 135 forks source link

Using cxl to simulate pmem in qemu, the writing speed is very slow, only 300k/s #257

Open 80ca86 opened 10 months ago

80ca86 commented 10 months ago

environment:qemu-8.1.1,host OS: CentOS7.2, guest OS: CentOS8 with linux kernel 6.5, ndctl-78 execute on the host: modprobe brd rd_nr=1 rd_size=16777216 max_part=0 mke2fs /dev/ram0 mount /dev/ram0 /tmp/pmem0 dd if=/dev/zero of=/tmp/pmem0/cxltest.raw bs=1M count=512 dd if=/dev/zero of=/tmp/pmem0/lsa.raw bs=1M count=256

Start qemu: qemu-system-x86_64 /root/CentOS-Stream-GenericCloud-8-latest.x86_64.qcow2 \ -smp 4 \ -m 4G \ -net nic \ -net tap,ifname=tap1,script=/etc/qemu-ifup,downscript=no \ -vnc :0 \ -daemonize \ -enable-kvm \ -machine type=q35,cxl=on \ -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/pmem0/cxltest.raw,size=512M \ -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/pmem0/lsa.raw,size=256M \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G

execute on the virtual machine: [root@localhost ~]# cxl create-region -d decoder0.0 -t pmem -m mem0 { "region":"region0", "resource":"0x190000000", "size":"512.00 MiB (536.87 MB)", "type":"pmem", "interleave_ways":1, "interleave_granularity":256, "decode_state":"commit", "mappings":[ { "position":0, "memdev":"mem0", "decoder":"decoder2.0" } ] } cxl region: cmd_create_region: created 1 region [root@localhost ~]# ndctl create-namespace { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":"502.00 MiB (526.39 MB)", "uuid":"53f3a16b-39d3-43de-95bf-ceb3d67f6d08", "sector_size":512, "align":2097152, "blockdev":"pmem0" } [root@localhost ~]# fio -output=/root/fio_result.txt -name=100S100W -filename=/dev/pmem0 -ioengine=libaio -direct=1 -blocksize=4K -size=128M -rw=write -iodepth=8 -numjobs=1 [root@localhost ~]# ][100.0%][w=319KiB/s][w=79 IOPS][eta 00m:00s] fio uses 100% of CPU, perf result:

Why is __memcpy_flushcache so slow?

victoryang00 commented 10 months ago

This is a qemu implementation problem, it does not optimize to the memory level parallelism. All requests to simulated CXL.mem go to CXL.mem metadata and interpret to vpa->dpa memory access. image

victoryang00 commented 9 months ago

We fixed the mmio bakend problem and got 250MB/Sec now for qemu CXL device. The problem is it hides the memory accesses in KVM code. It would be tricky to add SSD emulation to it. Currently we're using it with fsdax mode will check the device speed with devdax mode. Just DAX is preallocating the memory for u, no other hacks.