Closed Joshuaalbert closed 8 months ago
I believe this makes sense. Essentially, in the final case there are no axes over which QuartiCal knows how to parallelize. Have you read the following section of the docs? Essentially, it boils down to QuartiCal having two layers of parallelism. The first is at the chunk level - QuartiCal uses Dask to process chunks in parallel. In the third case, there is only a single chunk, so the Dask-based parallelism cannot function. The second level of parallelism is in the solvers themselves. It allows multiple threads to process solution intervals in parallel. In the third of the above cases there is only a single solution interval so the solvers cannot make use of this parallelism either. Consequently, the last case will effectively be the serial processing time.
One other thing I should point out is that values of dask.threads
larger than the total number of chunks will not lead to additional parallelism. In general, if you are dealing with a small number of chunks, then setting solver.threads
to a larger value will usually be better. It should be noted that even that will have diminishing returns on very small chunks i.e. those containing few unique times. QuartiCal will actually create fake solution intervals in time in order to better utilise the parallelism in the solvers, but at present it will never subdivide a single time, hence the problem you see in the third case.
Please let me know if any of this is unclear - this is probably one of the more complicated/nuanced components of QuartiCal.
Thank you very much for the prompt reply @JSKenyon! While it is sad to learn that the timings are correct, because it means 0.9s/iter is with solution interval parallelism, it is good to reinforce that quartical is working as expected.
Out of interest, how many antennas are you testing with? I am well aware the QuartiCal could be faster - much of the design is aimed at keeping memory footprints sane/scaling to distributed systems, sometimes at the cost of speed. The kernels could also be better optimized to make better use of SIMD. I guess I would be interested in hearing your feedback/opinions on the pros and cons of QuartiCal at present.
We're with 2048 antennas. I'll be writing the entire solver in XLA primitives and compiling that down to see how fast we can make it. I think it'll be faster, from my past experiences working with XLA. Also, we'll be incorperating a prior for the solutions to reduce the number of iterations required. Also, would like to ask you a few more questions about your experience experimenting while writing the solver for quartical, to help guide me. Maybe we could set up a call? I'm in same timezone as you, so really anytime works.
Also, I must not understand things because I got this timing again:
MS1 has 10 time ints, and 1 channel:
With solution chunk of 1x1 it did all 10 chunks in 154seconds, and used on average 16 iterations / chunk --> 15.4s / chunk | 1.0s / iter
MS3 has 1 time int, and 1 channel:
With solution chunk of 1x1 it does the single chunk in 130seconds, and used 16 iterations --> 130s / chunk | 8.1s / iter
However I used these settings:
input_ms.time_chunk=1 <- 1x1 chunks per dask worker
input_ms.freq_chunk=1 <-/
dask.threads=1 <- only one worker, so only one chunk at a time
solver.threads=32 <- that one worker gets to use 32 threads (there are 32 CPUs on this machine)
solver.terms=["G"]
G.time_interval=1 <- one solution interval per chunk therefore also per worker
G.freq_interval=1 <-/
But this seems at odds with the timing to do MS1: 154s for 10 chunks @ 16 iterations/chunk, vs MS3: 130s for 1 chunk @ 16 iterations/chunk. Did my parameters not do what I thought they should do?
Is there some constant overhead that both had to do? I.e. is it something like this:
total_time = num_chunks * num_iters * time_per_iter + constant_overhead
If I assume that then I get:
154s = 10 chunks * 16 iters * time_per_iter + constant_overhead
130s = 1 chunk * 16 iters * time_per_iter + constant_overhead
==>
time_per_iter = 0.16s / iter
constant_overhead = 127s
But this seems at odds with the timing to do MS1: 154s for 10 chunks @ 16 iterations/chunk, vs MS3: 130s for 1 chunk @ 16 iterations/chunk. Did my parameters not do what I thought they should do?
Just wanted to check - was there only a single unique time in the MS3 case? If so, the solver threads will not be able to do anything. This seems loosely consistent with what you are seeing. Imagine both cases as having solver.threads=1
. Then the 10 chunk case took roughly the same amount of time as the 1 chunk case - this makes sense as solving 1 chunk with one thread should be roughly the same as solving ten chunks with ten threads. I think the missing piece is the fact that the solver-level parallelism is over solution interval or, in the limiting case, unique time. In this example, assuming there is only a single unique time per chunk, the solver-level parallelism cannot kick in. I guess I never really considered needing parallelism in the single time, single channel, single solution interval case.
Edit: Upon reflection, this may not be strictly true. It is also worth making sure that you are not seeing the numba compilation costs.
I verified just now that there are 10 unique times in MS3
same as solving ten chunks with ten threads
But since dask.threads=1
wouldn't only one chunk be considered at a time?
But since
dask.threads=1
wouldn't only one chunk be considered at a time?
Yeah, I realised my error. Could you please do one very silly check for me? Please run each experiment twice in a row and let me know if the time changes dramatically between runs. That would point to something related to the caching of compiled code. Unfortunately, the JIT compilation is currently quite slow, and typically does take around 2 minutes (consistent with your estimate above).
Maybe we could set up a call? I'm in same timezone as you, so really anytime works.
Absolutely - when would be convenient for you?
Would today's afternoon work? Say around 14:00 SAT?
Would today's afternoon work? Say around 14:00 SAT?
Sure! I have two email addresses for you, one ending in .io and one ending in .nl. Which should I be using? Typically easier to organize meetings via email, with reduced risk of disclosing personal data.
The .nl one is good for tihs :)
Is there some constant overhead that both had to do? I.e. is it something like this:
total_time = num_chunks * num_iters * time_per_iter + constant_overhead
If I assume that then I get:
154s = 10 chunks * 16 iters * time_per_iter + constant_overhead 130s = 1 chunk * 16 iters * time_per_iter + constant_overhead ==> time_per_iter = 0.16s / iter constant_overhead = 127s
@JSKenyon and I called, and we found it was caching causing this overhead. Resolved.
Describe the problem that the feature should address I have been timing runs with different chunk sizes and have observed:
MS1 has 10 time ints, and 1 channel:
14.5s / chunk | 0.9s / iter
MS2 has 20 time ints, and 64 channels:
150s / chunk | 0.9s / iter
MS3 has 1 time int, and 1 channel:
130s / chunk | 8.1s / iter
I was using
dask.threads=32
andsolver.threads=32
, and settingtime_chunk=time_interval
andfreq_chunk=freq_interval
(I assume this means we only solve the chunks sequentially). The question is why do I get0.9s/iter
in the first two cases, and8.1s / iter
in the third case?