ratt-ru / QuartiCal

CubiCal, but with greater power.
MIT License
7 stars 4 forks source link

[Help] Compute time doesn't make sense #306

Closed Joshuaalbert closed 8 months ago

Joshuaalbert commented 8 months ago

Describe the problem that the feature should address I have been timing runs with different chunk sizes and have observed:

MS1 has 10 time ints, and 1 channel:

MS2 has 20 time ints, and 64 channels:

MS3 has 1 time int, and 1 channel:

I was using dask.threads=32 and solver.threads=32, and setting time_chunk=time_interval and freq_chunk=freq_interval (I assume this means we only solve the chunks sequentially). The question is why do I get 0.9s/iter in the first two cases, and 8.1s / iter in the third case?

JSKenyon commented 8 months ago

I believe this makes sense. Essentially, in the final case there are no axes over which QuartiCal knows how to parallelize. Have you read the following section of the docs? Essentially, it boils down to QuartiCal having two layers of parallelism. The first is at the chunk level - QuartiCal uses Dask to process chunks in parallel. In the third case, there is only a single chunk, so the Dask-based parallelism cannot function. The second level of parallelism is in the solvers themselves. It allows multiple threads to process solution intervals in parallel. In the third of the above cases there is only a single solution interval so the solvers cannot make use of this parallelism either. Consequently, the last case will effectively be the serial processing time.

One other thing I should point out is that values of dask.threads larger than the total number of chunks will not lead to additional parallelism. In general, if you are dealing with a small number of chunks, then setting solver.threads to a larger value will usually be better. It should be noted that even that will have diminishing returns on very small chunks i.e. those containing few unique times. QuartiCal will actually create fake solution intervals in time in order to better utilise the parallelism in the solvers, but at present it will never subdivide a single time, hence the problem you see in the third case.

Please let me know if any of this is unclear - this is probably one of the more complicated/nuanced components of QuartiCal.

Joshuaalbert commented 8 months ago

Thank you very much for the prompt reply @JSKenyon! While it is sad to learn that the timings are correct, because it means 0.9s/iter is with solution interval parallelism, it is good to reinforce that quartical is working as expected.

JSKenyon commented 8 months ago

Out of interest, how many antennas are you testing with? I am well aware the QuartiCal could be faster - much of the design is aimed at keeping memory footprints sane/scaling to distributed systems, sometimes at the cost of speed. The kernels could also be better optimized to make better use of SIMD. I guess I would be interested in hearing your feedback/opinions on the pros and cons of QuartiCal at present.

Joshuaalbert commented 8 months ago

We're with 2048 antennas. I'll be writing the entire solver in XLA primitives and compiling that down to see how fast we can make it. I think it'll be faster, from my past experiences working with XLA. Also, we'll be incorperating a prior for the solutions to reduce the number of iterations required. Also, would like to ask you a few more questions about your experience experimenting while writing the solver for quartical, to help guide me. Maybe we could set up a call? I'm in same timezone as you, so really anytime works.

Joshuaalbert commented 8 months ago

Also, I must not understand things because I got this timing again:

MS1 has 10 time ints, and 1 channel:

With solution chunk of 1x1 it did all 10 chunks in 154seconds, and used on average 16 iterations / chunk --> 15.4s / chunk | 1.0s / iter

MS3 has 1 time int, and 1 channel:

With solution chunk of 1x1 it does the single chunk in 130seconds, and used 16 iterations --> 130s / chunk | 8.1s / iter

However I used these settings:

input_ms.time_chunk=1 <- 1x1 chunks per dask worker
input_ms.freq_chunk=1 <-/
dask.threads=1 <- only one worker, so only one chunk at a time
solver.threads=32 <- that one worker gets to use 32 threads (there are 32 CPUs on this machine)
solver.terms=["G"]
G.time_interval=1 <- one solution interval per chunk therefore also per worker
G.freq_interval=1 <-/

But this seems at odds with the timing to do MS1: 154s for 10 chunks @ 16 iterations/chunk, vs MS3: 130s for 1 chunk @ 16 iterations/chunk. Did my parameters not do what I thought they should do?

Joshuaalbert commented 8 months ago

Is there some constant overhead that both had to do? I.e. is it something like this:

total_time = num_chunks * num_iters * time_per_iter + constant_overhead

If I assume that then I get:

154s = 10 chunks * 16 iters * time_per_iter + constant_overhead
130s = 1 chunk * 16 iters * time_per_iter + constant_overhead

==> 
time_per_iter = 0.16s / iter
constant_overhead = 127s
JSKenyon commented 8 months ago

But this seems at odds with the timing to do MS1: 154s for 10 chunks @ 16 iterations/chunk, vs MS3: 130s for 1 chunk @ 16 iterations/chunk. Did my parameters not do what I thought they should do?

Just wanted to check - was there only a single unique time in the MS3 case? If so, the solver threads will not be able to do anything. This seems loosely consistent with what you are seeing. Imagine both cases as having solver.threads=1. Then the 10 chunk case took roughly the same amount of time as the 1 chunk case - this makes sense as solving 1 chunk with one thread should be roughly the same as solving ten chunks with ten threads. I think the missing piece is the fact that the solver-level parallelism is over solution interval or, in the limiting case, unique time. In this example, assuming there is only a single unique time per chunk, the solver-level parallelism cannot kick in. I guess I never really considered needing parallelism in the single time, single channel, single solution interval case.

Edit: Upon reflection, this may not be strictly true. It is also worth making sure that you are not seeing the numba compilation costs.

Joshuaalbert commented 8 months ago

I verified just now that there are 10 unique times in MS3

Joshuaalbert commented 8 months ago

same as solving ten chunks with ten threads

But since dask.threads=1 wouldn't only one chunk be considered at a time?

JSKenyon commented 8 months ago

But since dask.threads=1 wouldn't only one chunk be considered at a time?

Yeah, I realised my error. Could you please do one very silly check for me? Please run each experiment twice in a row and let me know if the time changes dramatically between runs. That would point to something related to the caching of compiled code. Unfortunately, the JIT compilation is currently quite slow, and typically does take around 2 minutes (consistent with your estimate above).

Maybe we could set up a call? I'm in same timezone as you, so really anytime works.

Absolutely - when would be convenient for you?

Joshuaalbert commented 8 months ago

Would today's afternoon work? Say around 14:00 SAT?

JSKenyon commented 8 months ago

Would today's afternoon work? Say around 14:00 SAT?

Sure! I have two email addresses for you, one ending in .io and one ending in .nl. Which should I be using? Typically easier to organize meetings via email, with reduced risk of disclosing personal data.

Joshuaalbert commented 8 months ago

The .nl one is good for tihs :)

Joshuaalbert commented 8 months ago

Is there some constant overhead that both had to do? I.e. is it something like this:

total_time = num_chunks * num_iters * time_per_iter + constant_overhead

If I assume that then I get:

154s = 10 chunks * 16 iters * time_per_iter + constant_overhead
130s = 1 chunk * 16 iters * time_per_iter + constant_overhead

==> 
time_per_iter = 0.16s / iter
constant_overhead = 127s

@JSKenyon and I called, and we found it was caching causing this overhead. Resolved.