ldhulipala commented 5 years ago

QUESTION: Reading mmap'd file stored on one socket from threads across 2-sockets causes performance issues for read-only analytic

Details

Experiment was run on a machine with 2-sockets, each with 500G of NVM, both configured to use AppDirect mode. We store a large (~400G) file on one of the sockets. The file is opened using "pmem_from_file(...)". We then read the file using multiple threads. The reads can be thought of as large sequential scans per thread (each gets some block of the file to read and processes that block sequentially). Say the file is stored on socket 0. There are three experiments: 1) Reading only using threads from the same socket as the one the file is stored on (numactl -N 0 -m 0). This uses 48 hyper-threads. 2) Reading only using threads from the different socket as the one the file is stored on (numactl -N 1 -m 1). This also uses 48 hyper-threads. 3) Reading from both sockets (numactl -i all) This uses all 96 hyper-threads on the machine.

In cases (1) and (2) the time to perform the read is roughly what we expect (we are getting ~40Gb/s read throughput).

For (3) the performance degrades by a factor of 3x.

There also seems to be some "stickiness" in that if the file is accessed from socket 0, then re-running the application using numactl -N 0 -m 0 continues to perform well. If we then switch to socket 1, the first running time is slower by about a factor of 2 the first run, and then just as fast. Similarly if we then switch back to socket 0.

It would be really helpful if you could suggest any ideas for the observed experimental behavior. It is entirely possible we are using the library incorrectly.

Our current work-around is to store the file twice---one on each socket's memory and have the threads access the copy of the file from their local memory---but this wastes a factor of 2x memory (in general (num-sockets)x), and it would be great to understand how to avoid this.

Thanks for your help,

pbalcer commented 5 years ago

Do you need to frequently access the data from all available NUMA nodes? Wouldn't it be better to simply partition your workload between the sockets, and then to make sure that threads only access their local data? Accessing remote memory is always going to be slower, and it's generally better to simply avoid doing that if possible. I'm surprised that you are observing the same bandwidth numbers for experiments 1 and 2. The second one should be slower (by how much exactly depends on your platform) - this might indicate that you are not using DAX and the reads are buffered in DRAM. Please verify that is_pmem is set to true once the file is mapped.

We recommend VTune to investigate memory performance bottlenecks. See this article on how you evaluate your workload.

ldhulipala commented 5 years ago

Hi Piotr,

Re (1): I agree that accessing remote memory is going to be slower. We are using a work-stealing system which can handle remote-reads in the DRAM setting with performance close to NUMA-optimized systems (at least this is true in our experiments on 4-socket machines). I am surprised at how much slower accessing the NVM cross-socket is than restricting threads from only accessing local memory, and am wondering whether the magnitude of the slowdown we observed (3x slower) is consistent with what you would expect, since it seems suspiciously large to me, but I am still a novice at optimizing for NVM and am still calibrating my expectations.

Re (2): is_pmem is set on the file after mapping. Looking at htop/free -h the system does not seem to be caching the reads in DRAM. I am not sure what is causing the stickiness, although it is something we have consistently observed.

Optimizing our code to completely avoid cross-socket NVM reads resulted in a significant speedup over the version of the code that performs cross-socket NVM reads. In the coming week or so I am planning to demonstrate this slowdown with an easy-to-understand and easily reproducible benchmark and will report back here.

Thanks again for your help, Laxman

pbalcer commented 5 years ago

There are many factors which might be contributing to your observed slowdown, and so it's difficult for me to definitively answer if your performance is inline with expectations.

I'm assuming that the NVM you are using is Intel® Optane™ DC Persistent Memory, correct? Due to the difference between access granularity size of the current generation of the DIMMs (256 bytes) and the size of a cacheline (64 bytes), some traffic patterns will see only 1/4 of the ideal case bandwidth. This might further degrade with remote access.

You might find this paper to be an interesting read: https://arxiv.org/abs/1903.05714

ldhulipala commented 5 years ago

Yes, we are using exactly that memory. Thanks for your suggestions, and the link to the paper!

I also should mention that the numa_balancing flag does not seem to affect this issue (the slowdown when moving from 1 to 2 sockets happens independently of the value of this flag).

Laxman

pmem / issues

Reading mmap'd file stored on one socket from threads across multiple sockets causes performance issues #1079

QUESTION: Reading mmap'd file stored on one socket from threads across 2-sockets causes performance issues for read-only analytic

Details