YiranShu commented 5 years ago

Originally, the throughput of running "./PiBench ./libstlmap_wrapper.so -r 1 --pcm=false" is around 440000 ops/s. Since the overhead of the selection for keys, values and operations is not negligible, putting the selection out of the timing section doubles the throughput. Now the throughput of running "./PiBench ./libstlmap_wrapper.so -r 1 --pcm=false" is around 1.07e+06 ops/s.

llersch commented 5 years ago

Hi,

thanks for the PR. I had tried pre-generating the keys, values, operations before to verify the overhead and my experience was that the overhead was not so big. But this is a very critical aspect and it's very good that other people are looking into it to make sure I'm not messing up.

Your approach will work for value_generator and operation_generator, but I'm not for the key_generator. The key_generator class has an internal buffer where it materializes the key . The pointer returned by next() is a pointer to this buffer, so you have to use the key right away because the next call to next() will reuse that buffer for the next key. I did this to remove any memory allocation from the code path,

Therefore, if you want to pre-generate the keys as well you would have to do a memcpy() from the pointer returned by next() to somewhere else. It could be that in your approach all the key pointers you stored are pointing to the same key (the last one generated), in which case the final size of the std::map is going to be equals to the amount of threads which would explain the gain in performance. Could you check if that's the case?

YiranShu commented 5 years ago

Hi Lucas,

You are almost right. But there is a small mistake. Since the command I input did not insert any records into the std::map, the size of the std::map was its initial size, 1000000. But my misuse of key_generator made the program read the same key for 1000000 times, which was the reason for the big performance improvement. After correcting my misuse of key_generator, the throughput of running "./PiBench ./libstlmap_wrapper.so -r 1 --pcm=false" is about 630000ops/s. And the performance is only increased by about 30%. Just as you said, I have to allocate space for every key in advance of running the operations. Do you think this tradeoff is worthy?

llersch commented 5 years ago

Hi, it seems to be correct now. I did a quick benchmark (only a single run of each) to check the overhead and I get the following results:

4 Bytes Keys and 4 Bytes Values:

Records / #Operations | master | Pull Request | Speed Up |

------------ | ------------- | ------------- | ------------- 1M / 1M | 2.4052e+06 | 2.61208e+06 | 8% 10M / 10M | 1.12582e+06 | 1.20844e+06 | 7% 100M / 100M | 721060 | 743281 | 3%

8 Bytes Keys and 8 Bytes Values:

Records / #Operations | master | Pull Request | Speed Up |

------------ | ------------- | ------------- | ------------- 1M / 1M | 2.17998e+06 | 2.31371e+06 | 8% 10M / 10M | 1.06439e+06 | 1.1217e+06 | 7% 100M / 100M | 673177 | 713533 | 3%

This was a single-threaded read-only workload. I'm considering if this overhead is worth the additional memory overhead of pre-generating the workload. It introduces the limitation that the user can only use about half the memory available for the data structure itself. We should check if the gains for multiple threads and other workloads are similar or higher before merging the PR.

XiangpengHao commented 5 years ago

Hi,

I've made a quick perf on dummy_wrapper

It looks like the read performance is quite slower than the insert's, and the overhead comes from the key generator in "read" case.

Any ideas on this behavior?

sfu-dis / pibench

Put key, value and operation selection out of timing section #7

Records / #Operations | master | Pull Request | Speed Up |

Records / #Operations | master | Pull Request | Speed Up |