some questions about CLFLUSH_OPT/CLWB

onceltl commented 4 years ago

Excuse me, I have read your paper and code，very interesting. But I have a question about CLFLUSH_OPT/CLWB。 Below is your implementation

inline void clflush(char *data, int len, bool front, bool back)
{
      volatile char *ptr = (char )((unsigned long)data & ~(cache_line_size - 1));
      if (front)
               mfence();
      for (; ptr < data+len; ptr += cache_line_size){
              unsigned long etsc = read_tsc() +
             (unsigned long)(write_latency_in_ns * cpu_freq_mhz/1000);
#ifdef CLFLUSH
             asm volatile("clflush %0" : "+m" ((volatile char )ptr));
#elif CLFLUSH_OPT
             asm volatile(".byte 0x66; clflush %0" : "+m" ((volatile char )(ptr)));
#elif CLWB
             asm volatile(".byte 0x66; xsaveopt %0" : "+m" ((volatile char *)(ptr)));
#endif
            while (read_tsc() < etsc) cpu_pause();
     }
     if (back) 
             mfence();
}

In fact, CLFLUSH_OPT/CLWB will be reorder. In some indexes, Should mfence() be added between cachelines instead of adding mfence() at the end? E.g. the FAST & FAIR paper clearly mentions the need to add mfence() at cacheline boundary. In the original implementation of FAST&FAIR, only CLFLUSH was used, which will not cause problems, because CLFLUSH will not be reorder. So will using CLFLUSH_OPT/CLWB cause a bug?

SeKwonLee commented 4 years ago

Thanks for your question. About your question, CLWBs (and CLFLUSH_OPTs) can be allowed to be reordered about the stores preceding the last 8-byte commit store. This optimization has been applied to all RECIPE conversions to reduce costs per CLWB. For example, when we insert a new key into the index using Copy-on-Write like P-HOT, we first allocate a new object (assuming the size of this object is 256bytes), copy the existing keys&values from the old one to the new one, and then persist the new object by calling four CLWBs. Finally, we commit this modification by swapping and persisting the parent pointer to the new object by calling one CLWB. In these series of stores and CLWBs, the only store (CLWB) that should be strictly ordered is the last store changing parent pointer. However, it doesn't matter whether the persistent stores (256bytes) preceding the last commit store are reordered or not as long as we make sure they happen before the last commit store. Therefore, we can allow the four CLWBs involved in the 256-byte new object to be reordered.

onceltl commented 4 years ago

I know this optimization can be used in out-palce modify operations. But can this optimization be used in in-palce modify operations (such as fastfair)? I found that your version of fastfair also used this optimization.

SeKwonLee commented 4 years ago

Our implementation of FAST&FAIR (the original implementation either) calls CLWBs at every cacheline boundary with two fences when it shifts key-value pairs within a node. So, we are not allowing CLWBs in different cacheline boundaries to be reordered. In the insertion function, it calls a CLWB after identifying each cacheline boundary.

Here is the wrapper function for cacheline flushes in our provided FAST&FAIR implementation.

static inline void clflush(char *data, int len)
{
    volatile char *ptr = (char *)((unsigned long)data &~(CACHE_LINE_SIZE-1));
    mfence();
    for(; ptr<data+len; ptr+=CACHE_LINE_SIZE){
        unsigned long etsc = read_tsc() + 
            (unsigned long)(write_latency_in_ns*CPU_FREQ_MHZ/1000);
#ifdef CLFLUSH
        asm volatile("clflush %0" : "+m" (*(volatile char *)ptr));
#elif CLFLUSH_OPT
        asm volatile(".byte 0x66; clflush %0" : "+m" (*(volatile char *)(ptr)));
#elif CLWB
        asm volatile(".byte 0x66; xsaveopt %0" : "+m" (*(volatile char *)(ptr)));
#endif
        while (read_tsc() < etsc) cpu_pause();
    }
    mfence();
}

For more details, please refer to the source code of our FAST&FAIR (link).

onceltl commented 4 years ago

Thank you for your answer. I ignored some details.

utsaslab / RECIPE

some questions about CLFLUSH_OPT/CLWB #9