pmem / pmdk

Persistent Memory Development Kit
https://pmem.io
Other
1.34k stars 510 forks source link

libpmem-issue: cache line align :pmem_persist && 8 bytes atomic persistent store #5431

Closed James-tech-007 closed 1 year ago

James-tech-007 commented 2 years ago

issue 2 cache line align and 8 bytes atomic persistent store

how can we make cache line align?

if the cache line align requirement is satisfied, will the cache write to pmem when cache eviction and pmem_persist() call which persist data less than 8 bytes be 8 bytes atomic persistent store ?

description:

the book Programming Persistent Memory--A Comprehensive Guide for Developers written by Steve Scargall say:

Byte-addressable memory guarantees atomicity of only a single write. For current processors, that is generally one 64-bit word (8-bytes) that should be aligned, but this is not a requirement in practice.

On Intel hardware, the atomic persistent store is 8 bytes. Thus, if the program or system crashes while an aligned 8-byte store to persistent memory is in-flight, on recovery those 8 bytes will either contain the old contents or the new contents.

question

So how can we ensure the data in cache line is aligned?And if a cache line is aligned, in which situation will the 8 bytes atomic persistent store happen(manually call the pmem_persist() or cache eviction)?

guess

It seems that every time I use pmem_map_file(), it gives me a virtual address which is a multiple of 8. if I guarantee that every element I stored is no bigger than 8 bytes and their virtual address in memory is all multiple of 8, can I say these data will be aligned 8-byte data int cache when they enter the cache line and write them to pmem can be a atomic persistent store no matter the write is caused by manually call the pmem_persist() or cache eviction(which is transparent for programmer and may happen before the call to pmem_persist()) i.e. Will these method ensure the aligned 8-bytes cache and utilize the 8-bytes atomic persistent store given by the hardware?

example

//data_to_modify is data already stored on pmem
persist_ptr p_modify = null;//the persist_ptr is stored on pmem too. It serves as a flag to inspect whether the crash happen when the data is modifying.It is not the one in libpmemobj, it may be user defined data structure which can locate the data on the pmem,such as an unsigned int which indicate the offset of the address from the start address return by pmem_map_file();**assume that the size of persist_ptr is bigger than 8 bytes.**
//before modify the data on pmem,first record its location
p_modify = locate(data_to_modify); //function locate return the persist_ptr to the data on pmem
pmem_persist(&p_modify,sizeof(p_modify));
change_value(data_to_modify,key);
//function change value modify the data and persist the data inside it using pmem_persist()
p_modify = null;//the modify finish
pmem_persist(&p_modify,sizeof(p_modify));

if the crash appear between line 4 and 5, i.e. when we use virtual address to modify the flag on pmem :p_modify = locate(data_to_modify);and before we persist it use pmem_persist(&p_modify,sizeof(p_modify));the crash happens. Because of cache eviction, some part of the p_modify may be stored on pmem but others not. so the flag has an invalid record of the position of the data which is modifying when crash. This will affect the recovery when reboot because we can't locate the right position of the data which is modifying when crash due to the flag doesn't write to pmem atomically. However, if the 8-bytes atomic persistent store also adaptable for cache eviction and the flag's(in the example above is a persist_ptr) size is not bigger than 8 bytes, will the code above becomes right.

a more concrete description of the example above is:

#define null_persist_ptr 0
typedef uint64_t persist_ptr;//flag with 8 bytes size

struct Tree{
    //...
}
struct Root{
    persist_ptr p_modify;
    struct Tree T1;
}

int main(){
    Root * pmemaddr;
    if ((pmemaddr = (Root*)pmem_map_file(path.c_str(), MAPLEN,
        PMEM_FILE_CREATE,
        0666, &mapped_len, &is_pmem)) == NULL) {
        cerr << "err pmem_map_file\n";
        exit(1);
    }
    //the Root structure located at the start of the file on pmem
    pmemaddr->p_modify = pmemaddr->T1.root;
    //***what will happen when crash here???
    pmem_persist(&(pmemaddr->p_modify),sizeof(persist_ptr));
    change_value(pmemaddr->p_modify,key);
    pmemaddr->p_modify = null_persist_ptr;
    pmem_persist(&(pmemaddr->p_modify),sizeof(persist_ptr));
}

last but not least, the 8-bytes atomic persistent store is ensured by hardware, it seems that pmdk library utilize this to ensure the basic log and atomic pointer updates which ensures the consistency of the data on pmem. So if a hardware platform doesn't have such instruction, will pmdk library still work well?

James-tech-007 commented 2 years ago

image-20220502221215773 this picture illustrates my idea. I would really appreciate if some body can help me soon.

sscargal commented 2 years ago

@James-tech-007 Thanks for reading the book.

On Intel platforms, the cache line is 64-bytes and there are four (4) cache lines, meaning a CPU is capable of reading or writing up to 256bytes per memory cycle. We get the maximum bandwidth when multiple application threads read/write using 256bytes or more. The default size of a virtual memory page in the operating system is 4K, though you can use 2MiB or 1MiB pages. The CPU breaks down the page to 4Kib/256bytes = 16 memory operations.

With the exception of the new MOVDIR64b instruction which writes 64-bytes atomically, all the other machine instructions write 8-bytes atomically.

Whilst the data does not strictly have to be aligned if the size of the data is less than 8-bytes, but not aligned, multiple operations may have to be performed to complete the request, which is not optimal. Alignment means the address is modulo 8. If the data size is less than and aligned to an 8-byte boundary, you can pad it to ensure the next address is aligned. See Data structure alignment. It's up to the application developer to handle the data alignment and padding design and implementation.

If your application performs less than 64-bytes at a time, you can either pad it to 64-byte aligned or allow the memory controller to perform write combining.

It's impractical to have all the data structures use 8bytes, so you should use a higher-level library such as libpmemobj that uses atomic operations and redo/undo logs to handle the situation where the application or system crashes between the start and end of the memory operation. libpmemobj can roll back or re-play the transaction when the app starts again. The concept is no different than a database commit operation.

James-tech-007 commented 2 years ago

Thank you very much to let me know what is "align"

Alignment means the address is modulo 8.

If the data size is less than and aligned to an 8-byte boundary, you can pad it to ensure the next address is aligned.

It's up to the application developer to handle the data alignment and padding design and implementation.

I didn't use libpmemobj because I'm doing a research on algorithm on pmem, so efficiency is of great significance. I want to use strategy like copy-on-write and versioning from a low level by myself. And I found that libpmemobj depends on transaction very much, which means it will use lots of log to record the operation for redo and undo. I'm afraid that this will impact the efficiency of my program.

I'm sorry that I didn't make my question direct enough to answer. Let me give a more concrete demo.

consider the 8-byte-unsigned int below

image-20220503094358765

use this program to store it on pmem using libpmem library. The question appear at

uint64_t *pos = (uint64_t *)pos_store ;//if above is true, then the pos will be 8-byte aligned.
//****
  uint64_t data = 42497;
    pos_store = pmemaddr + ALIGN * BIAS;//void * pos_store
    uint64_t *pos = (uint64_t *)pos_store ;
    *pos = data;
    //what will happen when crash here? will just a part of the uint64_t be written into pmem?
    pmem_persist((void*)pos,sizeof(uint64_t));
//****
//****
    data = *pos ;
    cout << "the data stored by first launch is: " << data ;
    //can we ensure the data output here is either 0 or 42497?
    //or it may be 1 because just the lowest byte was written into pmem when crash?
//****
#include <iostream>
#include <cstring>
using namespace std;
#include <libpmem.h>
#define MAPLEN 4096<<2
#define ALIGN 8
#define PATH "/mnt/pmem0/sjz/pool"
#define BIAS 100
int mapped_len, is_pmem;

//first launch
int main(){
    void * pmemaddr , * pos_store;
    if ((pmemaddr = pmem_map_file(PATH, MAPLEN,
        PMEM_FILE_CREATE,
        0666, &mapped_len, &is_pmem)) == NULL) {
        cerr << "err pmem_map_file\n";
        exit(1);
    }

    pmem_memset_persist(pmemaddr,0,mapped_len);//there is no crash here

    uint64_t data = 42497;
    pos_store = pmemaddr + ALIGN * BIAS;//is the value of pmemaddr a multiple of 8 when use pmem_map_file()?
    uint64_t *pos = (uint64_t *)pos_store ;//if above is true, then the pos will be 8-byte aligned.
//****
    *pos = data;
    //what will happen when crash here? will just a part of the uint64_t be written into pmem?
    pmem_persist((void*)pos,sizeof(uint64_t));
//****
    pmem_unmap(pmemaddr,mapped_len);
}

//second launch
int main(){
    void * pmemaddr , * pos_store;
    if ((pmemaddr = pmem_map_file(PATH, MAPLEN,
        PMEM_FILE_CREATE,
        0666, &mapped_len, &is_pmem)) == NULL) {
        cerr << "err pmem_map_file\n";
        exit(1);
    }
    uint64_t data;
    pos_store = pmemaddr + ALIGN * BIAS;
    uint64_t *pos = (uint64_t *)pos_store ;

//****
    data = *pos ;
    cout << "the data stored by first launch is: " << data ;
    //can we ensure the data output here is either 0 or 42497?
    //or it may be 1 because just the lowest byte was written into pmem when crash?
//****
    pmem_unmap(pmemaddr,mapped_len);
}

Could you please tell me yes or no or not sure and then simply explain the reason?

If my question is meaningless , please forgive me and let me know why. Thank you very much!

janekmi commented 1 year ago

Hi. I am not sure whether the conclusion was satisfactory to you. But it seems the discussion is so old it might no longer be relevant to you.

If you consider this question still important to you please reopen the issue and provide more context for your request so we can reassess its priority.