charles-typ commented 3 years ago

Hi @ooibc88 @cac2003 @guowentian

I ran some performance benchmarks on GAM that yield unexpected latency numbers when I increase the number of servers, and I was hoping to get some insights from you regarding them. Below are details on the experimental setup, methodology and results.

Experiment setup:

Two servers VM1 and VM2 with 512MB of local memory, and all memory used as cache.
One server VM3 with all available DRAM used as local memory (~10GB), and no cache.

Therefore VM1 and VM2 fetch data from VM3, and keep it in their local cache.

Method:

I replayed several memory traces captured from different applications against GAM, under two scenarios (listed below), and recorded the execution time for both of them. The memory footprint of the application (~1GB) is larger than local cache size (512MB), so there are evictions along with invalidations. All memory accesses are 1 byte.

Scenario 1: Replay the memory traces for 10 threads on VM1, keep VM2 idle. Scenario 2: Replay the memory traces for 10 threads on VM1 and 10 threads on VM2; this means that there are invalidations between the VMs due to shared memory accesses.

Results:

I expected Scenario 2 to be slower due to more invalidations between VM1 and VM2, but found Scenario 2 was actually faster than Scenario 1.

To understand the results better, I profiled the memory access latency in GAM, separating the latency for local and remote memory accesses (as shown in the table below; only measured for read operations, since write operations are always asynchronous under the PSO model).

	Local access latency(us)	Remote access latency(us)
Scenario 1	2.2	299
Scenario 2	1.4	84

Even though there are invalidations in Scenario 2, the remote access latency is smaller for Scenario 2 compared to Scenario 1. Also there is a slight speed up in local memory accesses in Scenario 2.

Despite extensive profiling, I was unable to explain this strange behavior; is this expected? If so, why? Thank you for taking the time to read this issue --- I would really appreciate any help!

Second222None commented 8 months ago

Hi @charles-typ , @guowentian , @ooibc88 , @cac2003 I am doing a similar thing and adapting GAM to run on RoCE. However, when I tried to run ./scripts/benchmark-all.sh with 3 VMs, I had a segmentation fault.

Can you give some advice? Thanks in advance!

...
cannot find the key for hash table widCliMap (key not found in table)
...
(gdb) bt
#0  0x00000000004503a2 in std::atomic_flag::test_and_set (__m=std::memory_order_acquire, this=0x1b399ff0)
    at /usr/include/c++/10.3.1/bits/atomic_base.h:202
#1  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::spinlock::lock (this=0x1b399ff0)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:164
#2  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::lock_two (i2=6579, i1=34142, hp=<optimized out>, this=0x7ffff5ba3018)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:784
#3  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::snapshot_and_lock_two (hv=<optimized out>, this=<optimized out>)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:833
#4  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::find (val=<synthetic pointer>: <optimized out>, key=@0x7fffc670dd3c: 20,
    this=0x7ffff5ba3018) at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:473
#5  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::find (key=@0x7fffc670dd3c: 20, this=0x7ffff5ba3018)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:483
#6  HashTable<unsigned int, Client*>::at (key=@0x7fffc670dd3c: 20, this=0x7ffff5ba3018) at ../include/hashtable.h:61
#7  Server::FindClient (this=0x7ffff5ba3010, qpn=<optimized out>) at server.cc:168
#8  0x000000000045051c in Server::ProcessRdmaRequest (this=0x7ffff5ba3010, wc=...) at server.cc:38
#9  0x0000000000422e47 in Worker::StartService (w=0x7ffff5ba3010) at worker.cc:186
#10 0x00007ffff7e64270 in ?? () from /usr/lib64/libstdc++.so.6
#11 0x00007ffff7b234ca in ?? () from /usr/lib64/libc.so.6
#12 0x00007ffff7ba5ec0 in ?? () from /usr/lib64/libc.so.6

charles-typ commented 8 months ago

@Second222None

I suggest you open a separate issue for this. I'm unsure about your problem, but maybe you can refer to my forked repo. I made a significant amount of fixes to make it work, so I hope you can get some help from the changes.

ooibc88 / gam

Abnormal memory access latency when using multiple servers #8

Experiment setup:

Method:

Results: