Open j0807s opened 9 months ago
We figured out that the memory in all of NUMA nodes is in the same tier: $ ls /sys/devices/virtual/memory_tiering memory_tier4 power uevent $ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist 0-3
Hello Junsu and Jongmin,
Thank you for reporting this issue.
However, we have encountered an issue where all the memory devices, including the expanders, have the same memory tier (i.e., memory_tier), which may hinder automatic promotion and demotion.
If all NUMA nodes are on the same memory tier, promotion and demotion won't happen.
How can we create a new memory tier for the CXL devices and utilize them as second-tiered memory?
As far as I know, there is no way to change the tier of NUMA nodes other than applying custom patches when building your Linux kernel. Maybe we can share the simple patch that we've used for tests. @honggyukim will it be okay?
Also, we are currently working on RFC v2 for LKML and it will include patches that enable users to set destination nodes for migrations regardless of the memory tier of the system. However, I'm not sure when we will post it. Still, I'll update you when it's available.
Thank you for your explanation and support!
Hi Junsu and Jongmin,
Thanks for the report. As mentioned by @hyeongtakji, the current HMSDK 2.0 won't work unless your system has tiered memory setup.
We figured out that the memory in all of NUMA nodes is in the same tier:
$ ls /sys/devices/virtual/memory_tiering memory_tier4 power uevent $ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist 0-3
If you want to make the NUMA node 0, 1 as first tier, and node 2, 3 as second tier, you can just use the following workaround change.
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 437441cdf78f..13f82b5d67e8 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -18,6 +18,7 @@
* the same memory tier.
*/
#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
+#define MEMTIER_ADISTANCE_CXL (MEMTIER_ADISTANCE_DRAM * 5)
struct memory_tier;
struct memory_dev_type {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 37a4f59d9585..3fdbc3c9bfa9 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(memory_tier_lock);
static LIST_HEAD(memory_tiers);
static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
static struct memory_dev_type *default_dram_type;
+static struct memory_dev_type *default_cxl_type;
static struct bus_type memory_tier_subsys = {
.name = "memory_tiering",
@@ -484,7 +485,10 @@ static struct memory_tier *set_node_memory_tier(int node)
if (!node_state(node, N_MEMORY))
return ERR_PTR(-EINVAL);
- __init_node_memory_type(node, default_dram_type);
+ if (node < 2)
+ __init_node_memory_type(node, default_dram_type);
+ else
+ __init_node_memory_type(node, default_cxl_type);
memtype = node_memory_types[node].memtype;
node_set(node, memtype->nodes);
@@ -646,6 +650,9 @@ static int __init memory_tier_init(void)
default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
if (IS_ERR(default_dram_type))
panic("%s() failed to allocate default DRAM tier\n", __func__);
+ default_cxl_type = alloc_memory_type(MEMTIER_ADISTANCE_CXL);
+ if (IS_ERR(default_cxl_type))
+ panic("%s() failed to allocate default CXL tier\n", __func__);
/*
* Look at all the existing N_MEMORY nodes and add them to
Also, we are currently working on RFC v2 for LKML and it will include patches that enable users to set destination nodes for migrations regardless of the memory tier of the system. However, I'm not sure when we will post it. Still, I'll update you when it's available.
I'm preparing for this now. Hopefully, I can post it maybe by the next week. I will share the patch here when it's updated.
Thanks.
CXL Expander: PCIe 5.0 , Each with 96 GB
I worry if you use 2 CXL expander cards. The current kernel change might not be able to find a proper promotion target in the second CXL node. It's due to the inaccuracy of node distance in the upstream kernel, but we better find a better way to handle this problem. If we have the explicit destination setting in DAMON, then this can be handled later.
For now, I would recommend you to test your workload with a single CXL expander. And more importantly, please make sure if your evaluation environment has enough cold memory so that you can demote them to CXL memory. Having those cold memory, you can make enough space for CXL to DRAM promotion.
In other words, if your system has large working set that is larger than your DRAM capacity, then you won't be able to get benefit. We created large amount of cold memory with mmap
program for evaluation and you can think that those mmap
ed cold memory as idle VMs in data centers.
Please see our evaluation environment for more explanation. https://github.com/skhynix/hmsdk/wiki/HMSDK-v2.0-Performance-Results
Thank you for sharing the modification and experiment setup details.
We will immediately modify the source code before the patch is updated and rebuild the kernel with a single CXL expander.
Please let us know when you have issues again. Thanks!
We have patched the kernel and have observed that the promotion and demotion work during our experiments!
We sincerely appreciate your help!
I'm glad to hear that it's working in your environment. Please don't hesitate when you have more issues later. Thanks.
Also, we are currently working on RFC v2 for LKML and it will include patches that enable users to set destination nodes for migrations regardless of the memory tier of the system. However, I'm not sure when we will post it. Still, I'll update you when it's available.
I'm preparing for this now. Hopefully, I can post it maybe by the next week. I will share the patch here when it's updated.
The RFC v2 patches are posted at https://lore.kernel.org/linux-mm/20240226140555.1615-1-honggyu.kim@sk.com. In this patch series, /sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/target_nid
is created to set demotion/promotion target node ID explicitly. If this isn't set, then it uses memory tiering as a fallback.
If you're okay with the workaround patch above, then you don't need to use v2 patch, but I'm just sharing the recent update.
It seems the RFC v2 patches would provide much more flexibility to construct a tiered memory system with multiple CXL devices especially when considering NUMA topology(e.g., the 1st tier for the nodes 0,1,2 and the 2nd tier for node 3, etc).
Thank you for sharing the helpful information!
Hello,
We are currently testing HMSDK-2.0 with Hynix CXL devices. However, we have encountered an issue where all the memory devices, including the expanders, have the same memory tier (i.e., memory_tier), which may hinder automatic promotion and demotion.
How can we create a new memory tier for the CXL devices and utilize them as second-tiered memory?
Our environmental setup is as follows:
OS (kernel) : Ubuntu 22.04.3 LTS (Linux 6.6.0-hmsdk2.0+) CPU : Intel Xeon 4410Y (Sapphire Rapids) @2.0 GHz, 12 cores Memory (Socket 0,1) : 32 GB DDR5-4000 MT/s, Total 128 GB CXL Expander: PCIe 5.0 , Each with 96 GB Mother Board: Super X13DAI-T (Supporting CXL 1.1, CXL Type 3 Legacy Enabled)
Thanks.