spdk / spdk

Storage Performance Development Kit
https://spdk.io/
Other
3.09k stars 1.21k forks source link

Contiguous physical memory allocation in SPDK and DPDK by 1GB hugepages. #707

Closed shuhei-matsumoto closed 5 years ago

shuhei-matsumoto commented 5 years ago

Please use the issue tracker only for reporting suspected issues.

See The SPDK Community Page for other SPDK communications channels.

Our team want to allocate large blocks of contiguous physical memory (e.g. 16GB) by using spdk_dma_malloc(16GB) and enabling multiple 1GB pages at booting.

It worked when we use SPDK18.07 + DPDK18.02. However it doesn't work when we use SPDK19.01 + DPDK1808 or SPDK19.04pre + DPDK19.02.

SPDK18.07 + DPDK18.02

va: 0x00007f7240000000 pa: 0x0000000d40000000 va: 0x00007f7280000000 pa: 0x0000000d80000000 va: 0x00007f72c0000000 pa: 0x0000000dc0000000 va: 0x00007f7300000000 pa: 0x0000000e00000000 va: 0x00007f7340000000 pa: 0x0000000e40000000 va: 0x00007f7380000000 pa: 0x0000000e80000000 va: 0x00007f73c0000000 pa: 0x0000000ec0000000 va: 0x00007f7400000000 pa: 0x0000000f00000000 va: 0x00007f7440000000 pa: 0x0000000f40000000 va: 0x00007f7480000000 pa: 0x0000000f80000000

SPDK19.01 + DPDK18.08 or SPDK19.04pre + DPDK19.02

va: 0x00002000c0000000 pa: 0x0000000f40000000 va: 0x0000200100000000 pa: 0x0000000f00000000 va: 0x0000200140000000 pa: 0x0000000ec0000000 va: 0x0000200180000000 pa: 0x0000000e80000000 va: 0x00002001c0000000 pa: 0x0000000e40000000 va: 0x0000200200000000 pa: 0x0000000e00000000 va: 0x0000200240000000 pa: 0x0000000dc0000000 va: 0x0000200280000000 pa: 0x0000000d80000000 va: 0x00002002c0000000 pa: 0x0000000d40000000 va: 0x0000200300000000 pa: 0x0000000d00000000

Previously 1GB page was allocated by ascending order. However 1GB page is allocated by descending order now. Hence it looks that maximum contiguity is 1GB now but we want to have contiguity more than 1GB.

Expected Behavior

The behavior of SPDK18.07 + DPDK18.02

Current Behavior

The behavior of SPDK18.07 + DPDK18.02 is disconnected.

Possible Solution

Any suggestion will be appreciated.

Steps to Reproduce

Allocate enough number of 1GB huge pages.

Allocate more than 1GB buffer by spdk_memzone_reserve_aligned("contig", 10 GB, SPDK_ENV_SOCKET_ID_ANY, 0, GB); or spdk_dma_malloc(10 GB, GB, NULL);

Context (Environment including OS version, SPDK version, etc.)

SPDK18.07 + DPDK18.02 SPDK19.01 + DPDK18.08 SPDK19.04pre + DPDK19.02

jimharris commented 5 years ago

Hi @shuhei-matsumoto,

Can you try the spdk_memzone APIs instead? We needed to make a lot of changes in the memory allocation areas of SPDK to leverage the DPDK dynamic memory allocation support that went into DPDK 18.05. The spdk_memzone APIs should still enable allocation of physical/IOVA contiguous buffers.

-Jim

shuhei-matsumoto commented 5 years ago

Hi @jimharris @darsto

Thanks for the comment. I tried spdk_memzone API but the result was same. I don't think this is SPDK issue and the recommendation will be to use IOMMU.

We may have to manage our own DPDK patches but will you give any feedback if you can?

Thanks, Shuhei

jimharris commented 5 years ago

Sorry @shuhei-matsumoto - I see now that you had already tried the spdk_memzone APIs in your original report.

Could you try a couple of experiments for me?

1) specify the memory size when starting the application - for example "-s 16384" 2) in lib/env_dpdk/init.c, try enabling the --legacy-mem option (just remove the surrounding #ifdef around line 281)

shuhei-matsumoto commented 5 years ago

@jimharris

Thanks for the suggestion. I tried but didn't resolve the issue. I have found the presentation and the patch series by Anatoly Burakov and Bruce Richardson.

https://www.dpdk.org/wp-content/uploads/sites/35/2018/10/pm-01-2018-DPDK-Userspace-Memory.pdf

The patch series is large and will need time to understand. Asking a question to them first is reasonable?

[ DPDK EAL parameters: spdk --no-shconf -c 0x1 -m 16384 --legacy-mem --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --file-prefix=spdk_pid25567 ] va: 0x0000200180000000 pa: 0x0000000e80000000 va: 0x00002001c0000000 pa: 0x0000000e40000000 va: 0x0000200200000000 pa: 0x0000000e00000000 va: 0x0000200240000000 pa: 0x0000000dc0000000 va: 0x0000200280000000 pa: 0x0000000d80000000 va: 0x00002002c0000000 pa: 0x0000000d40000000 va: 0x0000200300000000 pa: 0x0000000d00000000 va: 0x0000200340000000 pa: 0x0000000cc0000000 va: 0x0000200380000000 pa: 0x0000000c80000000 va: 0x00002003c0000000 pa: 0x0000000c40000000

diff --git a/lib/env_dpdk/init.c b/lib/env_dpdk/init.c

index f459d10..2fd304e 100644
--- a/lib/env_dpdk/init.c
+++ b/lib/env_dpdk/init.c
@@ -131,7 +131,8 @@ spdk_env_opts_init(struct spdk_env_opts *opts)
        opts->name = SPDK_ENV_DPDK_DEFAULT_NAME;
        opts->core_mask = SPDK_ENV_DPDK_DEFAULT_CORE_MASK;
        opts->shm_id = SPDK_ENV_DPDK_DEFAULT_SHM_ID;
-       opts->mem_size = SPDK_ENV_DPDK_DEFAULT_MEM_SIZE;
+/*     opts->mem_size = SPDK_ENV_DPDK_DEFAULT_MEM_SIZE; */
+       opts->mem_size = 16384;
        opts->master_core = SPDK_ENV_DPDK_DEFAULT_MASTER_CORE;
        opts->mem_channel = SPDK_ENV_DPDK_DEFAULT_MEM_CHANNEL;
 }
@@ -278,13 +279,13 @@ spdk_build_eal_cmdline(const struct spdk_env_opts *opts)
                }
        }

-#if RTE_VERSION >= RTE_VERSION_NUM(18, 05, 0, 0) && RTE_VERSION < RTE_VERSION_NUM(18, 5, 1, 0)
+/* #if RTE_VERSION >= RTE_VERSION_NUM(18, 05, 0, 0) && RTE_VERSION < RTE_VERSION_NUM(18, 5, 1, 0) */
        /* Dynamic memory management is buggy in DPDK 18.05.0. Don't use it. */
        args = spdk_push_arg(args, &argcount, _sprintf_alloc("--legacy-mem"));
        if (args == NULL) {
                return -1;
        }
-#endif
+/* #endif */

        if (opts->num_pci_addr) {
                size_t i;
@@ -335,12 +336,12 @@ spdk_build_eal_cmdline(const struct spdk_env_opts *opts)
         * physically or IOVA contiguous memory regions, then when we go to allocate a buffer pool, it can
         * the memory for a buffer over two allocations meaning the buffer will be split over a memory regi
         */
-#if RTE_VERSION >= RTE_VERSION_NUM(19, 02, 0, 0)
+/*#if RTE_VERSION >= RTE_VERSION_NUM(19, 02, 0, 0)
        args = spdk_push_arg(args, &argcount, _sprintf_alloc("%s", "--match-allocations"));
        if (args == NULL) {
                return -1;
        }
-#endif
+#endif*/

        if (opts->shm_id < 0) {
                args = spdk_push_arg(args, &argcount, _sprintf_alloc("--file-prefix=spdk_pid%d",

The test code our team provide is the following

void print_virt2phy(const void *vaddr, off_t stride, int count)
{
    int i;
    uint64_t paddr;

    for (i = 0; i < count; i++) {
        paddr = rte_mem_virt2phy(vaddr);
        printf("va: 0x%016lx pa: 0x%016lx\n", (uint64_t)vaddr, paddr);
        vaddr += stride;
    }
}

int main(void)
{
    struct spdk_env_opts opts;

    spdk_env_opts_init(&opts);
    if (spdk_env_init(&opts) < 0) {
        SPDK_ERRLOG("Unable to initialize SPDK env\n");
        return 1;
    }

    size_t page  = 1 * (1024 * 1024 * 1024ULL);
    size_t size  = 10 * page;
    size_t align = page;

    void *vaddr = spdk_memzone_reserve_aligned("contig", size, SPDK_ENV_SOCKET_ID_ANY, 0x00100200, align);
    /*    void *vaddr = spdk_dma_malloc(size, align, NULL); */
    if (vaddr == NULL) {
        SPDK_ERRLOG("spdk_dma_malloc failed\n");
        return 1;
    }

    print_virt2phy(vaddr, page, size / page);

    return 0;
}
shuhei-matsumoto commented 5 years ago

@jimharris

Interesting observation. I integrated dpdk18.02 but the result was same. But I haven't spent enough time in this issue, and so I might do human error.

Starting SPDK v19.04-pre / DPDK 18.02.2 initialization... [ DPDK EAL parameters: spdk --no-shconf -c 0x1 -m 16384 --base-virtaddr=0x200000000000 --file-prefix=spdk_pid6886 ] EAL: Detected 12 lcore(s) EAL: Multi-process socket /var/run/.spdk_pid6886_unix EAL: Probing VFIO support... va: 0x0000200140000000 == pa: 0x0000000e80000000 va: 0x0000200180000000 == pa: 0x0000000e40000000 va: 0x00002001c0000000 == pa: 0x0000000e00000000 va: 0x0000200200000000 == pa: 0x0000000dc0000000 va: 0x0000200240000000 == pa: 0x0000000d80000000 va: 0x0000200280000000 == pa: 0x0000000d40000000 va: 0x00002002c0000000 == pa: 0x0000000d00000000 va: 0x0000200300000000 == pa: 0x0000000cc0000000 va: 0x0000200340000000 == pa: 0x0000000c80000000 va: 0x0000200380000000 == pa: 0x0000000c40000000

darsto commented 5 years ago

I'd expect physical addresses to be sorted in ascending order. Thanks for the example app - I'll try to debug this.

darsto commented 5 years ago

Bug scrub: @benlwalker says that's just how new DPDK behaves.

shuhei-matsumoto commented 5 years ago

Hi @dastro @benlwalker ,

Thank you for taking and discussing this in the bug scrub meeting. I will investigate the internal of the new DPDK as possible as I can first.

shuhei-matsumoto commented 5 years ago

Hi @darsto CC: @benlwalker @jimharris

Our team found the root cause. The patch https://gerrithub.io/c/spdk/spdk/+/423491 doesn't allow contiguous physical memory allocation more than 1GB regardless the version of DPDK.

We have a question and why SPDK don't support any platform such that RTE_IOVA_PA is the only supported mode?

I admit that I didn't understand the patch when I reviewed and I just want to understand. Yourdback is very appreciated.

Thanks, Shuhei

shuhei-matsumoto commented 5 years ago

I add a little information about out analysis.

eal_memalloc_is_contig always returns true when the mode is RTE_IOVA_VA, and DPDK don't try to make address physically contiguous. So we need the mode is set to RTE_IOVA_PA in one of our use cases.

darsto commented 5 years ago

@shuhei-matsumoto I pushed a patch to remove that extra rte_bus. It doesn't provide any value to SPDK right now and only breaks your use case.

Please see here: https://review.gerrithub.io/c/spdk/spdk/+/448121

shuhei-matsumoto commented 5 years ago

@dastro Thank you so much! Now I could understand the reason why you needed this patch.

It will be great for us if we can allocate large contiguous memory without any local change and I believe you can get the final solution for the non-privileged mode.

Our team will do double check and post the result here. Then I will review the Gerrit.

Thanks, Shuhei.

shuhei-matsumoto commented 5 years ago

@darsto @jimharris

The patch worked fine but I needed to add --legacy-mem. Without --legacy-mem, almost contiguous but last 2 entries were not contiguous in my trial. I think we need legacy-mem option at SPDK booting separately to sustain this behavior.

[root@node1 1_test_no_contig]# ./test_paddr_not_contig Starting SPDK v19.04-pre / DPDK 19.02.0 initialization... [ DPDK EAL parameters: spdk --no-shconf -c 0x1 -m 16384 --base-virtaddr=0x200000000000 --legacy-mem --file-prefix=spdk_pid26331 ] EAL: Detected 12 lcore(s) EAL: Detected 1 NUMA nodes EAL: Probing VFIO support... EAL: VFIO support initialized va: 0x00002001c0000000 == pa: 0x00000002c0000000 va: 0x0000200200000000 == pa: 0x0000000300000000 va: 0x0000200240000000 == pa: 0x0000000340000000 va: 0x0000200280000000 == pa: 0x0000000380000000 va: 0x00002002c0000000 == pa: 0x00000003c0000000 va: 0x0000200300000000 == pa: 0x0000000400000000 va: 0x0000200340000000 == pa: 0x0000000440000000 va: 0x0000200380000000 == pa: 0x0000000480000000 va: 0x00002003c0000000 == pa: 0x00000004c0000000 va: 0x0000200400000000 == pa: 0x0000000500000000

[root@node1 1_test_no_contig]# ./test_paddr_not_contig Starting SPDK v19.04-pre / DPDK 19.02.0 initialization... [ DPDK EAL parameters: spdk --no-shconf -c 0x1 -m 16384 --base-virtaddr=0x200000000000 --file-prefix=spdk_pid16602 ] EAL: Detected 12 lcore(s) EAL: Detected 1 NUMA nodes EAL: Probing VFIO support... EAL: VFIO support initialized va: 0x0000200180000000 == pa: 0x0000000300000000 va: 0x00002001c0000000 == pa: 0x0000000340000000 va: 0x0000200200000000 == pa: 0x0000000380000000 va: 0x0000200240000000 == pa: 0x00000003c0000000 va: 0x0000200280000000 == pa: 0x0000000400000000 va: 0x00002002c0000000 == pa: 0x0000000440000000 va: 0x0000200300000000 == pa: 0x0000000480000000 va: 0x0000200340000000 == pa: 0x00000004c0000000 va: 0x0000200380000000 == pa: 0x0000000200000000 va: 0x00002003c0000000 == pa: 0x0000000180000000

shuhei-matsumoto commented 5 years ago

I created a patch to add --legacy-mem option. https://review.gerrithub.io/#/c/spdk/spdk/+/448263/ I think we can close this issue if Darek's patch is merged. (If we have new issue, we will re-open or open an uew issue.)

shuhei-matsumoto commented 5 years ago

I add a comment about --legacy-mem. By our analysis we understood that if we don't add --legacy-mem, the result of allocation depends on our luck, if we add --legacy-mem, DPDK ensures physically contiguous allocation. Maximum size is limited to 32GB in DPDK19.02 but this is OK for now.