Python Tensorstore af-dist implementation

jeromekelleher commented 2 months ago

The C++ af-dist implementation added in #157 (and notes in #160) is surprisingly slow. In some informal tests it looks like it's about 3X slower than Python Zarr, just retrieving the chunks. There's no good reason for this, so I suspect we must be using Tensorstore badly.

To make it easier to experiment with Tensorstore configuration settings, I think it would be very helpful to have a Python implementation of the afdist operation using the Python tensorstore API. The Python API is properly documented, and the (unbelievably slow!) C++ compile times don't have to be endured.

CPU time should be dominated by decompression, so we really should expect to have performance parity with the Python Zarr approach.

@Will-Tyler - fancy taking this one on?

tomwhite commented 2 months ago

This may be a useful tool for finding good Tensorstore configuration settings: https://github.com/royerlab/czpeedy (via https://github.com/cubed-dev/cubed/issues/513)

jeromekelleher commented 2 months ago

Looks like that's just for write performance, but definitely worth having a look at :+1:

Will-Tyler commented 2 months ago

I am certainly happy to investigate this issue! Before I start implementing afdist with TensorStore's Python API, I wonder if you built the project in Release mode (i.e. cmake -DCMAKE_BUILD_TYPE=Release ..). The performance I observed when working in Debug mode matches the performance that you are describing. However, I just checked out the latest version and built it in Release mode and found the Release-mode performance to be similar to the Python Zarr implementation:

time ./software/ts-afdist/build/ts_afdist scaling/data/chr21_10_4.zarr/call_genotype
# PROB_DIST, genotype probability distribution, assumes HWE
PROB_DIST       0       0.1     44835153
PROB_DIST       0.1     0.2     46629516
PROB_DIST       0.2     0.3     61330193
PROB_DIST       0.3     0.4     81254580
PROB_DIST       0.4     0.5     188462062
PROB_DIST       0.5     0.6     18017770
PROB_DIST       0.6     0.7     20193848
PROB_DIST       0.7     0.8     18465643
PROB_DIST       0.8     0.9     19033680
PROB_DIST       0.9     1       71487987
./software/ts-afdist/build/ts_afdist   30.68s user 1.21s system 96% cpu 32.941 total

python src/collect_data.py whole-matrix-compute scaling/data/chr21_10_4.ts -t zarr
scaling/data/chr21_10_4.ts n=10000, m=863998
   num_samples  num_sites  tool  user_time  sys_time  wall_time storage
0        10000     863998  zarr  31.880481  1.641813  30.567949     hdd

If you are already building in Release mode, can you describe your informal testing? I would like to replicate the problem to debug it.

References

CMAKE_BUILD_TYPE

P.S. I found --parallel helps make the build faster.

jeromekelleher commented 2 months ago

Thanks for the tip on Release mode @Will-Tyler, you're right I'd forgotten that.

This is what I get when I run the benchmarks (added in #162):

   num_samples  num_sites    tool  user_time  sys_time    wall_time storage
1           10     116230  ts_cpp       0.27      0.04     0.344683     hdd
0           10     116230    zarr       0.27      0.01     0.288311     hdd
3          100     204714  ts_cpp       0.66      0.04     0.725902     hdd
2          100     204714    zarr       0.52      0.01     0.541813     hdd
5         1000     403989  ts_cpp       2.10      0.07     2.196294     hdd
4         1000     403989    zarr       1.83      0.08     1.920323     hdd
7        10000     863998  ts_cpp      44.12      0.39    44.371482     hdd
6        10000     863998    zarr      30.59      0.13    31.309266     hdd
9       100000    2365367  ts_cpp    1238.77      6.34  1239.721479     hdd
8       100000    2365367    zarr     882.03     18.54  1271.120435     hdd

If we focus on the user_time column, we can see there's a substantial difference between tensorstore and Zarr python there. (The 100k samples result is confusing when you look at wall_time because the disk cache was warm for ts_cpp, but not for zarr.)

Looking at top for ts_cpp, I can see that tensorstore is running a bunch of background threads to do the fetching and decoding. Perhaps this is contributing to the overhead? It shouldn't be that much, though.

jeromekelleher commented 2 months ago

Basically, I don't mind if we poke at this in C++ or Python - I'd just find it easier to understand what Tensorstore is doing if we go through the Python API.

jeromekelleher commented 1 month ago

Cutting out the bits where we process the chunks,

Here's the perf report for ts_py:

  29.63%  ts_pool_worker  _tensorstore.cpython-39-x86_64-linux-gnu.so           [.] blosc_internal_bshuf_shuffle_bit_eightelem_scal                                                           
  15.71%  python3         [kernel.kallsyms]                                     [k] syscall_enter_from_user_mode                                                                              
  10.02%  ts_pool_worker  libc-2.31.so                                          [.] __memmove_sse2_unaligned_erms                                                                             
   4.49%  ts_pool_worker  _tensorstore.cpython-39-x86_64-linux-gnu.so           [.] blosc_internal_bshuf_trans_byte_bitrow_sse2                                                               
   4.01%  python3         [kernel.kallsyms]                                     [k] __x64_sys_futex                                                                                           
   3.85%  python3         [kernel.kallsyms]                                     [k] __schedule                                                                                                
   3.58%  python3         libc-2.31.so                                          [.] __sched_yield                                                                                             
   3.19%  ts_pool_worker  libc-2.31.so                                          [.] __memset_erms                                                                                             
   2.56%  python3         [kernel.kallsyms]                                     [k] _raw_spin_unlock_irqrestore                                                                               
   2.21%  ts_pool_worker  _tensorstore.cpython-39-x86_64-linux-gnu.so           [.] ZSTD_decompressBlock_internal.part.0                                                                      
   1.15%  python3         python3.9                                             [.] _PyEval_EvalFrameDefault                                                                                  
   1.00%  python3         [kernel.kallsyms]                                     [k] do_sched_yield                                                                                            
   0.96%  ts_pool_worker  [kernel.kallsyms]                                     [k] clear_page_erms                                                                                           
   0.85%  python3         [kernel.kallsyms]                                     [k] do_syscall_64                                                                                             
   0.59%  python3         [kernel.kallsyms]                                     [k] get_timespec64

vs Zarr python (which is much quicker):

  37.77%  python3  blosc.cpython-39-x86_64-linux-gnu.so              [.] blosc_internal_bshuf_shuffle_bit_eightelem_scal                                                                      
   8.30%  python3  [kernel.kallsyms]                                 [k] __schedule                                                                                                           
   7.66%  python3  libc-2.31.so                                      [.] __sched_yield                                                                                                        
   7.65%  python3  [kernel.kallsyms]                                 [k] syscall_enter_from_user_mode                                                                                         
   7.05%  python3  blosc.cpython-39-x86_64-linux-gnu.so              [.] blosc_internal_bshuf_trans_byte_bitrow_sse2                                                                          
   3.38%  python3  blosc.cpython-39-x86_64-linux-gnu.so              [.] ZSTD_decompressBlock_internal                                                                                        
   2.63%  python3  python3.9                                         [.] _PyEval_EvalFrameDefault                                                                                             
   2.09%  python3  [kernel.kallsyms]                                 [k] do_sched_yield

jeromekelleher commented 1 month ago

Hmm - so looks like the threaded async approach tensorstore is using is adding substantial overhead here. I'll dig a bit more.

jeromekelleher commented 1 month ago

Tweaking the tensorstore context options a bit doesn't seem to help much. With just loading the chunks into memory we get:

   num_samples  num_sites   tool  user_time  sys_time  wall_time storage
0        10000     863998  ts_py      39.24      5.78   27.97086     hdd
   num_samples  num_sites  tool  user_time  sys_time  wall_time storage
0        10000     863998  zarr      13.82      0.12  13.947528     hdd

which is a huge disparity, given tensorstore should be doing very little here.

jeromekelleher commented 1 month ago

Hmmm, looking at strace output seems to confirm that tensorstore really is reading the chunk files once, although may be doing it with async APIs rather than sync file system calls.

laramiel commented 1 month ago

What is your test harness for this? What's the zarr spec? I can try to see if there are any C++ inefficiencies to remove here.

Is it just https://github.com/sgkit-dev/vcf-zarr-publication/pull/157/files#diff-236bc7438efdb12c94dbd13ba6e252219e51b1b7594cc63a8ef8ce7906aab453

jeromekelleher commented 1 month ago

Hi @laramiel :wave: Any help would be much appreciated!

We have two different implementations, the C++ one that we started out with, and a Python one which we made to make playing around with it a bit easier.

The easiest way to run this and compare the implementations is using the dataset we have created for benchmarking, and the benchmarking script. This should get you up and running:

Clone this repo, and cd to scaling.
Run make data/chr21_10_1.ts data/chr21_10_4.zarr. This should work to get you the file you want with a minimum of extra stuff :crossed_fingers: The files are pretty big so it'll take a while.
From the root directory, run python3 src/collect_data.py whole-matrix-compute -t zarr -t ts_py scaling/data/chr21_10_4.ts. This will output the stats comparing the Zarr Python and Tensorstore Python implementations.

For simplicity, you may want to factor out the actual processing of the chunks. You can do this by replacing the function count_genotypes_chunk with a no-op. These are the numbers I get on my (busy) machine when the two implementations are just iterating chunk-by-chunk and not doing any processing:

$ python3 src/collect_data.py whole-matrix-compute -t zarr -t ts_py scaling/data/chr21_10_4.ts  
   num_samples  num_sites   tool  user_time  sys_time  wall_time storage
1        10000     863998  ts_py      45.11      9.73  34.475312     hdd
0        10000     863998   zarr      17.88      0.12  18.026047     hdd

The overall CPU time cost is over 2X here for Tensorstore, which seems unnecessary?

The benchmarking setup is a bit convoluted sorry, it's something that's evolved over time. Please do let me know if you hit any problems, I'd be more than happy to help! Thanks again for taking a look.

benjeffery commented 1 month ago

@jeromekelleher I'm not seeing the c++ tensorstore implementation slowness on my desktop:

   num_samples  num_sites    tool  user_time  sys_time  wall_time storage
1        10000     863998  ts_cpp      15.73      0.11  15.932257     hdd
2        10000     863998   ts_py      60.19      8.88  69.118334     hdd
0        10000     863998    zarr      19.83      0.05  19.896405     hdd

Will run the larger dataset overnight.

laramiel commented 1 month ago

Both of those have a very similar structure where they issue a read and then wait via .Wait() or .result() after the read. It might be faster if they were converted to async methods, though you might need a semaphore to limit memory use.

jeromekelleher commented 1 month ago

@jeromekelleher I'm not seeing the c++ tensorstore implementation slowness on my desktop:

That's interesting - @Will-Tyler noted something similar. I'll rerun and see what we get.

Still not obvious why the Python version is using so much more CPU though. I see the argument for sync vs async, but I would have thought that this should only affect the overall wall-time rather than actual CPU cycles used (user time)?

benjeffery commented 1 month ago

ts_py was spending a lot of it's time copying data around in numpy calls. Here's where I have got it to (will PR):

   num_samples  num_sites    tool  user_time  sys_time  wall_time storage
1        10000     863998  ts_cpp      15.93      0.07  16.093759     hdd
2        10000     863998   ts_py      20.65      0.13  20.871725     hdd
0        10000     863998    zarr      20.18      0.04  20.228303     hdd

jeromekelleher commented 1 month ago

That's super - exactly what we'd hope to see. And excellent that TensorStore c++ is outperforming the Python code!

jeromekelleher commented 1 month ago

Here's what I'm getting currently:

11        10000     863998  ts_cpp      44.01      0.44  44.289501     hdd                     
10        10000     863998   ts_py      75.32     13.66  54.940146     hdd                     
9         10000     863998    zarr      30.62      0.17  30.832301     hdd

It must be something to do with the setup on my server I guess?

jeromekelleher commented 1 month ago

Are you working off an SSD or spinning platters here @benjeffery?

benjeffery commented 1 month ago

Are you working off an SSD or spinning platters here @benjeffery?

I've tested both and seen little difference. I've also just realised that you merged #164 yesterday afternoon - my previous python numbers were with code before that. :facepalm:

Here are the numbers with that included:

   num_samples  num_sites    tool  user_time  sys_time  wall_time storage
1        10000     863998  ts_cpp      16.01      0.09  16.189232     hdd
2        10000     863998   ts_py      23.96      0.09  24.145617     hdd
0        10000     863998    zarr      20.43      0.04  20.498896     hdd

jeromekelleher commented 1 month ago

Good to know. So, must be something specific to my setup so - the machine and OS are quite old, so could be some important compile time optimisations are missing or something.

Did you do anything funky with the Tensorstore build?

benjeffery commented 1 month ago

Did you do anything funky with the Tensorstore build?

Nope nothing at all - just installed cmake and ran the build as it is configured in the repo.

benjeffery commented 1 month ago

I'm running

cmake version 3.30.2
NASM version 2.15.05
gcc version 11.4.0

jeromekelleher commented 1 month ago

Well this really is bizarre. After upgrading to the latest version of the Tensorstore Python module (which I'm not compiling, using the distributed binarys) I get this:

   num_samples  num_sites    tool  user_time  sys_time  wall_time storage
2        10000     863998  ts_cpp      58.77      0.72  59.302641     hdd
1        10000     863998   ts_py      89.10      0.50  89.613749     hdd
0        10000     863998    zarr      31.43      0.09  31.563379     hdd

So, the Python tensorstore version here is nearly a full minute slower than the Zarr version!

Digging in with perf I get this for the Zarr python version: zarr-full

and this for Tensorstore Python version: ts_py_all

So in both cases, the majority of the time is spent in Blosc, as you'd expect, but one takes 3X the time of the other. The processor here is old (Xeon E5-2680) and that probably explains the disparity with @benjeffery's results.

I can think of two possibilities:

Zarr Python is somehow doing a better job of passing aligned memory to blosc.
The official blosc package is compiled differently to the one vendored in with tensorstore, and that works better on this CPU.

@benjeffery - could you run these through perf record also, so we can see what's being called under the hood on your machine?

It would be nice if we could just say "oh well it doesn't work on old machines :shrug: " but this is the hardware I have available for doing these benchmarks, and they need a lot of storage and CPU time. I'm not keen on moving them somewhere else, so if we want to include Tensorstore we'll have to figure out a way of solving the issue.

laramiel commented 1 month ago

So it's possible that we're missing the default -msse2 flag when building blosc, which is a default option when building with CMake. I'll try and update those and see if it helps.

Edit: Nah, -msse2 should be the default with all x86_64 arch.

Edit: I noticed that you set the concurrency limits pretty low. Was there a reason for this?

            "data_copy_concurrency": {"limit": 1},
            "file_io_concurrency": {"limit": 1},

https://google.github.io/tensorstore/kvstore/file/index.html

jbms commented 1 month ago

I haven't had a chance to look at your benchmarks and results in detail, but one thing to note is that the blosc codec in numcodecs (https://github.com/zarr-developers/numcodecs/blob/02d3ceec395481f3d84eb8f2f770e66aca6361b0/numcodecs/blosc.pyx#L478) used by zarr-python uses multi-threaded blosc under certain conditions that may apply in your benchmark. Tensorstore never uses multi-threaded blosc but instead can decode multiple chunks in parallel; in the case of your benchmark there is only one chunk at a time, though.

jeromekelleher commented 1 month ago

Thanks for the input @laramiel and @jbms!

@laramiel

Edit: I noticed that you set the concurrency limits pretty low. Was there a reason for this?

That was just experimenting trying to rule out the overhead of having background threads. As we're just synchronously working chunk-by-chunk, there's nothing to gain from doing things in threads, and I was trying this to see if it made any difference. It doesn't.

@jbms

zarr-python uses multi-threaded blosc under certain conditions

This should all be factored out by measuring user time - true, having threads in blosc could affect the wall time, but it shouldn't make any difference to the overall compute cost. Wall-time and User-time + Sys-time are basically the same here, so I don't think threads are having any effect.

benjeffery commented 1 month ago

Here are my perf reports: Zarr: zarr Ts:

__memmove_sse2_unaligned_erms (only seen under tensorstore) is a larger fraction on the Xeon, that's expected as ERMS was only introduced with Ivy Bridge and your Xeon is Sandy Bridge. That doesn't explain all the time though, but it's interesting that zarr doesn't need to use that nearly as much.

jeromekelleher commented 1 month ago

Ah yes, interestingly you have __memmove_avx_unaligned_erms vs I have __memmove_sse2_unaligned_erms on the old Xeon under Tensorstore.

My guess here is that Zarr is somehow handing memory to Blosc that meets the stricter alignment requirements for SSE/AVX than Tensorstore, and so we don't see this `__memmove_XXX_unaligned_erms call in Zarr. The newer generation of CPUs is better able to handle stuff that's not aligned.

It would be amazing if the penalty for not having aligned memory under SSE2 was this big, but I think it's the best guess right now?

I see there are some calls to __memset_xxx under Zarr, so perhaps this is a memset vs memmove thing?

jeromekelleher commented 1 month ago

Interesting... When I run perf collecting the full call graph, it seems to be telling me that the memmove is happening when Tensorstore copies data out (in CopyReadChunk). I think there might be some room for improvement there @laramiel @jbms?

call-graph

benjeffery commented 1 month ago

What I don't quite understand is how dissimilar your perf report cycle count ratio is to your timing ratio. 31s vs 89s compared with 161,859,771,726 vs 189,617,929,134? The sample count looks consistent, so tensorstore must be waiting more? Here's the result of doing

perf sched record -a --
perf sched latency

which shows there isn't much switching between python and the ts_pool_worker: sched

Also shows the time is equally shared, would be interested to see this for the xeon.

jeromekelleher commented 1 month ago

That is puzzling all right - 3X penalty just for unaligned memory seems huge.

I'm not able to run the sched events in perf right now and have to context switch I'm afraid!

jeromekelleher commented 1 month ago

Anecdotally, I noticed the same problem when running on some Azure VMs last week. I ran the Python Tensorstore version and it was way slower than Zarr python. I didn't have time to characterise properly I'm afraid, but it does indicate that this is more than just a problem with crusty old hardware and likely something that people will encounter in the real world.

jbms commented 1 month ago

I am able to reproduce this locally per your instructions:

python3 src/collect_data.py whole-matrix-compute -t zarr -t ts_py scaling/data/chr21_10_4.ts                                                                         
scaling/data/chr21_10_4.ts n=10000, m=863998                                                                                                                                                                        
   num_samples  num_sites  tool  user_time  sys_time  wall_time storage                                                                                                                                             
0        10000     863998  zarr      10.43      0.09  10.524919     hdd                                                                                                                                             
   num_samples  num_sites   tool  user_time  sys_time  wall_time storage                                                                                                                                            
1        10000     863998  ts_py      14.75      0.17  14.962039     hdd                                                                                                                                            
0        10000     863998   zarr      10.43      0.09  10.524919     hdd

Running both the tensorstore and zarr-python benchmarks with google-perftools separately, the additional time with tensorstore seems to be as follows:

(a) blosc_decompress: 11.12 seconds (tensorstore) / 8.15 seconds (zarr-python) This accounts for most of the difference (in blosc_internal_bshuf_shuffle_bit_eightelem_scal specifically). More investigation will be needed to determine the source of the discrepancy here. The two likely explanations that come to mind are:

Difference in generated code quality due to different compiler versions / flags used. (Note that because the element size is 1, the avx2 and sse2 implementations actually don't apply which may mean that compiler optimizations have a larger impact.)
Cache effects due to additional memset and memcpy done by tensorstore. Because blosc is decoding a 20MB chunk this seems rather unlikely but it is hard to rule it out entirely without further testing.

(b) An additional ~1 second is spent by tensorstore on an unnecessary memset of the output buffer used by blosc. This is pure overhead and we should be able to fix this relatively easily.

(c) An additional ~1 second is spent by tensorstore on a memcpy to the destination array from the chunk cache. Currently tensorstore always decodes first to the chunk cache, but we do have some work in progress to bypass the chunk cache in cases like this where the chunk cache has a total_bytes_limit of 0 and the read covers the entire chunk.

The additional memset and memcpy have a particularly strong impact on this benchmark because of the very high compression ratio --- 100 to 1.

We will definitely look into these 3 causes, and try to get to the bottom of the blosc shuffle performance discrepancy, and remove the unnecessary memset and memcpy.

I'll note that this benchmark seems to be obviously very useful for identifying sources of overhead like these, but because of the very high compression ratio (meaning the I/O is negligible) and the fact that it involves just reading a single complete chunk at a time, it essentially just tests chunk decoding and not any of the indexing or I/O functionality of the zarr implementation. If the benchmark matches the actual use case then that is the right thing to test, though.

jeromekelleher commented 1 month ago

Fantastic, thanks for looking into this @jbms! I would love to include the Tensorstore benchmarks in the paper because cross language support and the potential to integrate with existing C++ based tools is a key part of our story. If you can help us with the Tensorstore overhead here we'd be very happy to include you (and anyone else involved) as authors on this paper.

I think the benchmark is a useful one, in that it follows a very common access pattern for the data, and implements the core allele classification operation that will be used a great deal, e.g., in dataset quality control. Certainly the compression ratios are realistic, and what we observe for real data.

sgkit-dev / vcf-zarr-publication

Python Tensorstore af-dist implementation #161

References