Open jeromekelleher opened 2 months ago
This may be a useful tool for finding good Tensorstore configuration settings: https://github.com/royerlab/czpeedy (via https://github.com/cubed-dev/cubed/issues/513)
Looks like that's just for write performance, but definitely worth having a look at :+1:
I am certainly happy to investigate this issue! Before I start implementing afdist with TensorStore's Python API, I wonder if you built the project in Release mode (i.e. cmake -DCMAKE_BUILD_TYPE=Release ..
). The performance I observed when working in Debug mode matches the performance that you are describing. However, I just checked out the latest version and built it in Release mode and found the Release-mode performance to be similar to the Python Zarr implementation:
time ./software/ts-afdist/build/ts_afdist scaling/data/chr21_10_4.zarr/call_genotype
# PROB_DIST, genotype probability distribution, assumes HWE
PROB_DIST 0 0.1 44835153
PROB_DIST 0.1 0.2 46629516
PROB_DIST 0.2 0.3 61330193
PROB_DIST 0.3 0.4 81254580
PROB_DIST 0.4 0.5 188462062
PROB_DIST 0.5 0.6 18017770
PROB_DIST 0.6 0.7 20193848
PROB_DIST 0.7 0.8 18465643
PROB_DIST 0.8 0.9 19033680
PROB_DIST 0.9 1 71487987
./software/ts-afdist/build/ts_afdist 30.68s user 1.21s system 96% cpu 32.941 total
python src/collect_data.py whole-matrix-compute scaling/data/chr21_10_4.ts -t zarr
scaling/data/chr21_10_4.ts n=10000, m=863998
num_samples num_sites tool user_time sys_time wall_time storage
0 10000 863998 zarr 31.880481 1.641813 30.567949 hdd
If you are already building in Release mode, can you describe your informal testing? I would like to replicate the problem to debug it.
P.S. I found --parallel
helps make the build faster.
Thanks for the tip on Release mode @Will-Tyler, you're right I'd forgotten that.
This is what I get when I run the benchmarks (added in #162):
num_samples num_sites tool user_time sys_time wall_time storage
1 10 116230 ts_cpp 0.27 0.04 0.344683 hdd
0 10 116230 zarr 0.27 0.01 0.288311 hdd
3 100 204714 ts_cpp 0.66 0.04 0.725902 hdd
2 100 204714 zarr 0.52 0.01 0.541813 hdd
5 1000 403989 ts_cpp 2.10 0.07 2.196294 hdd
4 1000 403989 zarr 1.83 0.08 1.920323 hdd
7 10000 863998 ts_cpp 44.12 0.39 44.371482 hdd
6 10000 863998 zarr 30.59 0.13 31.309266 hdd
9 100000 2365367 ts_cpp 1238.77 6.34 1239.721479 hdd
8 100000 2365367 zarr 882.03 18.54 1271.120435 hdd
If we focus on the user_time column, we can see there's a substantial difference between tensorstore and Zarr python there. (The 100k samples result is confusing when you look at wall_time because the disk cache was warm for ts_cpp, but not for zarr.)
Looking at top for ts_cpp, I can see that tensorstore is running a bunch of background threads to do the fetching and decoding. Perhaps this is contributing to the overhead? It shouldn't be that much, though.
Basically, I don't mind if we poke at this in C++ or Python - I'd just find it easier to understand what Tensorstore is doing if we go through the Python API.
Cutting out the bits where we process the chunks,
Here's the perf report for ts_py:
29.63% ts_pool_worker _tensorstore.cpython-39-x86_64-linux-gnu.so [.] blosc_internal_bshuf_shuffle_bit_eightelem_scal
15.71% python3 [kernel.kallsyms] [k] syscall_enter_from_user_mode
10.02% ts_pool_worker libc-2.31.so [.] __memmove_sse2_unaligned_erms
4.49% ts_pool_worker _tensorstore.cpython-39-x86_64-linux-gnu.so [.] blosc_internal_bshuf_trans_byte_bitrow_sse2
4.01% python3 [kernel.kallsyms] [k] __x64_sys_futex
3.85% python3 [kernel.kallsyms] [k] __schedule
3.58% python3 libc-2.31.so [.] __sched_yield
3.19% ts_pool_worker libc-2.31.so [.] __memset_erms
2.56% python3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
2.21% ts_pool_worker _tensorstore.cpython-39-x86_64-linux-gnu.so [.] ZSTD_decompressBlock_internal.part.0
1.15% python3 python3.9 [.] _PyEval_EvalFrameDefault
1.00% python3 [kernel.kallsyms] [k] do_sched_yield
0.96% ts_pool_worker [kernel.kallsyms] [k] clear_page_erms
0.85% python3 [kernel.kallsyms] [k] do_syscall_64
0.59% python3 [kernel.kallsyms] [k] get_timespec64
vs Zarr python (which is much quicker):
37.77% python3 blosc.cpython-39-x86_64-linux-gnu.so [.] blosc_internal_bshuf_shuffle_bit_eightelem_scal
8.30% python3 [kernel.kallsyms] [k] __schedule
7.66% python3 libc-2.31.so [.] __sched_yield
7.65% python3 [kernel.kallsyms] [k] syscall_enter_from_user_mode
7.05% python3 blosc.cpython-39-x86_64-linux-gnu.so [.] blosc_internal_bshuf_trans_byte_bitrow_sse2
3.38% python3 blosc.cpython-39-x86_64-linux-gnu.so [.] ZSTD_decompressBlock_internal
2.63% python3 python3.9 [.] _PyEval_EvalFrameDefault
2.09% python3 [kernel.kallsyms] [k] do_sched_yield
Hmm - so looks like the threaded async approach tensorstore is using is adding substantial overhead here. I'll dig a bit more.
Tweaking the tensorstore context options a bit doesn't seem to help much. With just loading the chunks into memory we get:
num_samples num_sites tool user_time sys_time wall_time storage
0 10000 863998 ts_py 39.24 5.78 27.97086 hdd
num_samples num_sites tool user_time sys_time wall_time storage
0 10000 863998 zarr 13.82 0.12 13.947528 hdd
which is a huge disparity, given tensorstore should be doing very little here.
Hmmm, looking at strace output seems to confirm that tensorstore really is reading the chunk files once, although may be doing it with async APIs rather than sync file system calls.
What is your test harness for this? What's the zarr spec? I can try to see if there are any C++ inefficiencies to remove here.
Hi @laramiel :wave: Any help would be much appreciated!
We have two different implementations, the C++ one that we started out with, and a Python one which we made to make playing around with it a bit easier.
The easiest way to run this and compare the implementations is using the dataset we have created for benchmarking, and the benchmarking script. This should get you up and running:
scaling
.make data/chr21_10_1.ts data/chr21_10_4.zarr
. This should work to get you the file you want with a minimum of extra stuff :crossed_fingers: The files are pretty big so it'll take a while.python3 src/collect_data.py whole-matrix-compute -t zarr -t ts_py scaling/data/chr21_10_4.ts
. This will output the stats comparing the Zarr Python and Tensorstore Python implementations. For simplicity, you may want to factor out the actual processing of the chunks. You can do this by replacing the function count_genotypes_chunk with a no-op. These are the numbers I get on my (busy) machine when the two implementations are just iterating chunk-by-chunk and not doing any processing:
$ python3 src/collect_data.py whole-matrix-compute -t zarr -t ts_py scaling/data/chr21_10_4.ts
num_samples num_sites tool user_time sys_time wall_time storage
1 10000 863998 ts_py 45.11 9.73 34.475312 hdd
0 10000 863998 zarr 17.88 0.12 18.026047 hdd
The overall CPU time cost is over 2X here for Tensorstore, which seems unnecessary?
The benchmarking setup is a bit convoluted sorry, it's something that's evolved over time. Please do let me know if you hit any problems, I'd be more than happy to help! Thanks again for taking a look.
@jeromekelleher I'm not seeing the c++ tensorstore implementation slowness on my desktop:
num_samples num_sites tool user_time sys_time wall_time storage
1 10000 863998 ts_cpp 15.73 0.11 15.932257 hdd
2 10000 863998 ts_py 60.19 8.88 69.118334 hdd
0 10000 863998 zarr 19.83 0.05 19.896405 hdd
Will run the larger dataset overnight.
Both of those have a very similar structure where they issue a read and then wait via .Wait()
or .result()
after the read.
It might be faster if they were converted to async methods, though you might need a semaphore to limit memory use.
@jeromekelleher I'm not seeing the c++ tensorstore implementation slowness on my desktop:
That's interesting - @Will-Tyler noted something similar. I'll rerun and see what we get.
Still not obvious why the Python version is using so much more CPU though. I see the argument for sync vs async, but I would have thought that this should only affect the overall wall-time rather than actual CPU cycles used (user time)?
ts_py
was spending a lot of it's time copying data around in numpy calls. Here's where I have got it to (will PR):
num_samples num_sites tool user_time sys_time wall_time storage
1 10000 863998 ts_cpp 15.93 0.07 16.093759 hdd
2 10000 863998 ts_py 20.65 0.13 20.871725 hdd
0 10000 863998 zarr 20.18 0.04 20.228303 hdd
That's super - exactly what we'd hope to see. And excellent that TensorStore c++ is outperforming the Python code!
Here's what I'm getting currently:
11 10000 863998 ts_cpp 44.01 0.44 44.289501 hdd
10 10000 863998 ts_py 75.32 13.66 54.940146 hdd
9 10000 863998 zarr 30.62 0.17 30.832301 hdd
It must be something to do with the setup on my server I guess?
Are you working off an SSD or spinning platters here @benjeffery?
Are you working off an SSD or spinning platters here @benjeffery?
I've tested both and seen little difference. I've also just realised that you merged #164 yesterday afternoon - my previous python numbers were with code before that. :facepalm:
Here are the numbers with that included:
num_samples num_sites tool user_time sys_time wall_time storage
1 10000 863998 ts_cpp 16.01 0.09 16.189232 hdd
2 10000 863998 ts_py 23.96 0.09 24.145617 hdd
0 10000 863998 zarr 20.43 0.04 20.498896 hdd
Good to know. So, must be something specific to my setup so - the machine and OS are quite old, so could be some important compile time optimisations are missing or something.
Did you do anything funky with the Tensorstore build?
Did you do anything funky with the Tensorstore build?
Nope nothing at all - just installed cmake and ran the build as it is configured in the repo.
I'm running
cmake version 3.30.2
NASM version 2.15.05
gcc version 11.4.0
Well this really is bizarre. After upgrading to the latest version of the Tensorstore Python module (which I'm not compiling, using the distributed binarys) I get this:
num_samples num_sites tool user_time sys_time wall_time storage
2 10000 863998 ts_cpp 58.77 0.72 59.302641 hdd
1 10000 863998 ts_py 89.10 0.50 89.613749 hdd
0 10000 863998 zarr 31.43 0.09 31.563379 hdd
So, the Python tensorstore version here is nearly a full minute slower than the Zarr version!
Digging in with perf I get this for the Zarr python version:
and this for Tensorstore Python version:
So in both cases, the majority of the time is spent in Blosc, as you'd expect, but one takes 3X the time of the other. The processor here is old (Xeon E5-2680) and that probably explains the disparity with @benjeffery's results.
I can think of two possibilities:
@benjeffery - could you run these through perf record also, so we can see what's being called under the hood on your machine?
It would be nice if we could just say "oh well it doesn't work on old machines :shrug: " but this is the hardware I have available for doing these benchmarks, and they need a lot of storage and CPU time. I'm not keen on moving them somewhere else, so if we want to include Tensorstore we'll have to figure out a way of solving the issue.
So it's possible that we're missing the default -msse2 flag when building blosc, which is a default option when building with CMake. I'll try and update those and see if it helps.
Edit: Nah, -msse2 should be the default with all x86_64 arch.
Edit: I noticed that you set the concurrency limits pretty low. Was there a reason for this?
"data_copy_concurrency": {"limit": 1},
"file_io_concurrency": {"limit": 1},
https://google.github.io/tensorstore/kvstore/file/index.html
I haven't had a chance to look at your benchmarks and results in detail, but one thing to note is that the blosc codec in numcodecs (https://github.com/zarr-developers/numcodecs/blob/02d3ceec395481f3d84eb8f2f770e66aca6361b0/numcodecs/blosc.pyx#L478) used by zarr-python uses multi-threaded blosc under certain conditions that may apply in your benchmark. Tensorstore never uses multi-threaded blosc but instead can decode multiple chunks in parallel; in the case of your benchmark there is only one chunk at a time, though.
Thanks for the input @laramiel and @jbms!
@laramiel
Edit: I noticed that you set the concurrency limits pretty low. Was there a reason for this?
That was just experimenting trying to rule out the overhead of having background threads. As we're just synchronously working chunk-by-chunk, there's nothing to gain from doing things in threads, and I was trying this to see if it made any difference. It doesn't.
@jbms
zarr-python uses multi-threaded blosc under certain conditions
This should all be factored out by measuring user time - true, having threads in blosc could affect the wall time, but it shouldn't make any difference to the overall compute cost. Wall-time and User-time + Sys-time are basically the same here, so I don't think threads are having any effect.
Here are my perf reports: Zarr: Ts:
__memmove_sse2_unaligned_erms
(only seen under tensorstore) is a larger fraction on the Xeon, that's expected as ERMS was only introduced with Ivy Bridge and your Xeon is Sandy Bridge. That doesn't explain all the time though, but it's interesting that zarr
doesn't need to use that nearly as much.
Ah yes, interestingly you have __memmove_avx_unaligned_erms
vs I have __memmove_sse2_unaligned_erms
on the old Xeon under Tensorstore.
My guess here is that Zarr is somehow handing memory to Blosc that meets the stricter alignment requirements for SSE/AVX than Tensorstore, and so we don't see this `__memmove_XXX_unaligned_erms
call in Zarr. The newer generation of CPUs is better able to handle stuff that's not aligned.
It would be amazing if the penalty for not having aligned memory under SSE2 was this big, but I think it's the best guess right now?
I see there are some calls to __memset_xxx
under Zarr, so perhaps this is a memset vs memmove thing?
Interesting... When I run perf collecting the full call graph, it seems to be telling me that the memmove is happening when Tensorstore copies data out (in CopyReadChunk). I think there might be some room for improvement there @laramiel @jbms?
What I don't quite understand is how dissimilar your perf report cycle count ratio is to your timing ratio. 31s vs 89s compared with 161,859,771,726 vs 189,617,929,134? The sample count looks consistent, so tensorstore must be waiting more? Here's the result of doing
perf sched record -a --
perf sched latency
which shows there isn't much switching between python and the ts_pool_worker:
Also shows the time is equally shared, would be interested to see this for the xeon.
That is puzzling all right - 3X penalty just for unaligned memory seems huge.
I'm not able to run the sched events in perf right now and have to context switch I'm afraid!
Anecdotally, I noticed the same problem when running on some Azure VMs last week. I ran the Python Tensorstore version and it was way slower than Zarr python. I didn't have time to characterise properly I'm afraid, but it does indicate that this is more than just a problem with crusty old hardware and likely something that people will encounter in the real world.
I am able to reproduce this locally per your instructions:
python3 src/collect_data.py whole-matrix-compute -t zarr -t ts_py scaling/data/chr21_10_4.ts
scaling/data/chr21_10_4.ts n=10000, m=863998
num_samples num_sites tool user_time sys_time wall_time storage
0 10000 863998 zarr 10.43 0.09 10.524919 hdd
num_samples num_sites tool user_time sys_time wall_time storage
1 10000 863998 ts_py 14.75 0.17 14.962039 hdd
0 10000 863998 zarr 10.43 0.09 10.524919 hdd
Running both the tensorstore and zarr-python benchmarks with google-perftools separately, the additional time with tensorstore seems to be as follows:
(a) blosc_decompress: 11.12 seconds (tensorstore) / 8.15 seconds (zarr-python) This accounts for most of the difference (in blosc_internal_bshuf_shuffle_bit_eightelem_scal specifically). More investigation will be needed to determine the source of the discrepancy here. The two likely explanations that come to mind are:
(b) An additional ~1 second is spent by tensorstore on an unnecessary memset of the output buffer used by blosc. This is pure overhead and we should be able to fix this relatively easily.
(c) An additional ~1 second is spent by tensorstore on a memcpy to the destination array from the chunk cache. Currently tensorstore always decodes first to the chunk cache, but we do have some work in progress to bypass the chunk cache in cases like this where the chunk cache has a total_bytes_limit of 0 and the read covers the entire chunk.
The additional memset and memcpy have a particularly strong impact on this benchmark because of the very high compression ratio --- 100 to 1.
We will definitely look into these 3 causes, and try to get to the bottom of the blosc shuffle performance discrepancy, and remove the unnecessary memset and memcpy.
I'll note that this benchmark seems to be obviously very useful for identifying sources of overhead like these, but because of the very high compression ratio (meaning the I/O is negligible) and the fact that it involves just reading a single complete chunk at a time, it essentially just tests chunk decoding and not any of the indexing or I/O functionality of the zarr implementation. If the benchmark matches the actual use case then that is the right thing to test, though.
Fantastic, thanks for looking into this @jbms! I would love to include the Tensorstore benchmarks in the paper because cross language support and the potential to integrate with existing C++ based tools is a key part of our story. If you can help us with the Tensorstore overhead here we'd be very happy to include you (and anyone else involved) as authors on this paper.
I think the benchmark is a useful one, in that it follows a very common access pattern for the data, and implements the core allele classification operation that will be used a great deal, e.g., in dataset quality control. Certainly the compression ratios are realistic, and what we observe for real data.
The C++ af-dist implementation added in #157 (and notes in #160) is surprisingly slow. In some informal tests it looks like it's about 3X slower than Python Zarr, just retrieving the chunks. There's no good reason for this, so I suspect we must be using Tensorstore badly.
To make it easier to experiment with Tensorstore configuration settings, I think it would be very helpful to have a Python implementation of the afdist operation using the Python tensorstore API. The Python API is properly documented, and the (unbelievably slow!) C++ compile times don't have to be endured.
CPU time should be dominated by decompression, so we really should expect to have performance parity with the Python Zarr approach.
@Will-Tyler - fancy taking this one on?