Closed onceltl closed 4 years ago
Thanks for your question. About your question, CLWBs (and CLFLUSH_OPTs) can be allowed to be reordered about the stores preceding the last 8-byte commit store. This optimization has been applied to all RECIPE conversions to reduce costs per CLWB. For example, when we insert a new key into the index using Copy-on-Write like P-HOT, we first allocate a new object (assuming the size of this object is 256bytes), copy the existing keys&values from the old one to the new one, and then persist the new object by calling four CLWBs. Finally, we commit this modification by swapping and persisting the parent pointer to the new object by calling one CLWB. In these series of stores and CLWBs, the only store (CLWB) that should be strictly ordered is the last store changing parent pointer. However, it doesn't matter whether the persistent stores (256bytes) preceding the last commit store are reordered or not as long as we make sure they happen before the last commit store. Therefore, we can allow the four CLWBs involved in the 256-byte new object to be reordered.
I know this optimization can be used in out-palce modify operations. But can this optimization be used in in-palce modify operations (such as fastfair)? I found that your version of fastfair also used this optimization.
Our implementation of FAST&FAIR (the original implementation either) calls CLWBs at every cacheline boundary with two fences when it shifts key-value pairs within a node. So, we are not allowing CLWBs in different cacheline boundaries to be reordered. In the insertion function, it calls a CLWB after identifying each cacheline boundary.
Here is the wrapper function for cacheline flushes in our provided FAST&FAIR implementation.
static inline void clflush(char *data, int len)
{
volatile char *ptr = (char *)((unsigned long)data &~(CACHE_LINE_SIZE-1));
mfence();
for(; ptr<data+len; ptr+=CACHE_LINE_SIZE){
unsigned long etsc = read_tsc() +
(unsigned long)(write_latency_in_ns*CPU_FREQ_MHZ/1000);
#ifdef CLFLUSH
asm volatile("clflush %0" : "+m" (*(volatile char *)ptr));
#elif CLFLUSH_OPT
asm volatile(".byte 0x66; clflush %0" : "+m" (*(volatile char *)(ptr)));
#elif CLWB
asm volatile(".byte 0x66; xsaveopt %0" : "+m" (*(volatile char *)(ptr)));
#endif
while (read_tsc() < etsc) cpu_pause();
}
mfence();
}
For more details, please refer to the source code of our FAST&FAIR (link).
Thank you for your answer. I ignored some details.
Excuse me, I have read your paper and code,very interesting. But I have a question about CLFLUSH_OPT/CLWB。 Below is your implementation
In fact, CLFLUSH_OPT/CLWB will be reorder. In some indexes, Should mfence() be added between cachelines instead of adding mfence() at the end? E.g. the FAST & FAIR paper clearly mentions the need to add mfence() at cacheline boundary. In the original implementation of FAST&FAIR, only CLFLUSH was used, which will not cause problems, because CLFLUSH will not be reorder. So will using CLFLUSH_OPT/CLWB cause a bug?