ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

madmax raises error in numba using cubical v1.5.0 #365

Closed SpheMakh closed 4 years ago

SpheMakh commented 4 years ago
# ERROR      - 17:17:49 - solver             [x02] [0.2/0.2 1.5/1.5 0.7Gb] Solver for tile 0 chunk D0T1F0 failed with exception: Failed in nopython mode pipeline (step: convert to parfors)
# 'NoneType' object is not iterable
# ERROR      - 17:17:49 - solver             [x01] [0.2/0.2 1.5/1.5 0.7Gb] Solver for tile 0 chunk D0T0F0 failed with exception: Failed in nopython mode pipeline (step: convert to parfors)
# 'NoneType' object is not iterable
# INFO       - 17:17:49 - solver             [x01] [0.2/0.2 1.5/1.5 0.7Gb] Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 857, in run_solver
#     corr_vis = solver_machine.run()
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 680, in run
#     SolveOnly.run(self)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 664, in run
#     self.sol_opts, label=self.label)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/solver.py", line 275, in _solve_gains
#     "{} iter {} ({})".format(label, num_iter, gm.jones_label)):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/madmax/flagger.py", line 212, in beyond_thunderdome
#     mad, goodies = madmax.compute_mad_per_corr(absres, flags_arr, diag=self.mad_estimate_diag, offdiag=self.mad_estimate_offdiag)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/kernels/madmax.py", line 155, in compute_mad_per_corr
#     mad_arr, mad_arr_fl, valid_arr = compute_mad_per_corr_internals(absres, flags, diag, offdiag)
#   File "/usr/local/lib/python3.6/dist-packages/numba/dispatcher.py", line 420, in _compile_for_args
#     raise e
#   File "/usr/local/lib/python3.6/dist-packages/numba/dispatcher.py", line 353, in _compile_for_args
#     return self.compile(tuple(argtypes))
#   File "/usr/local/lib/python3.6/dist-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
#     return func(*args, **kwargs)

Find the full log here

SpheMakh commented 4 years ago

@o-smirnov , the test is here

https://github.com/ratt-ru/Stimela/blob/4d4c3ea306f65614ee4adbca3a2dd9a638f33ec3/stimela/tests/acceptance_tests/stimela-test-kat7.py#L423-452

bennahugo commented 4 years ago

Is madmax strictly necessary for a release though?

JSKenyon commented 4 years ago

This is a bit confusing @SpheMakh. In the test you linked madmax is not enabled, but in the log you sent it is. What is the intention?

JSKenyon commented 4 years ago

Ok, I can reproduce this locally. @o-smirnov https://github.com/ratt-ru/CubiCal/blob/ba23a6395196fa0bc750d4d07cc23bfd81644f5c/cubical/data_handler/ms_tile.py#L774

This is merely incidental, found it while trying to reproduce. Just an incomplete call and I suspect that the axis argument is incorrect.

The real issue is numba and the prange here: https://github.com/ratt-ru/CubiCal/blob/ba23a6395196fa0bc750d4d07cc23bfd81644f5c/cubical/kernels/madmax.py#L205

Looking at the function I don't actually think it is thread safe. If I cut out some of the dangerous looking operations it compiles again. I suggest we ditch the prange here for now.

JSKenyon commented 4 years ago

366 is a PR with possible fixes @o-smirnov.

JSKenyon commented 4 years ago

Anyone who is being blocked by these errors, just avoid using --dist-nthread and instead rely on the multiprocessing.

bennahugo commented 4 years ago

Can't really do that @JSKenyon. You essentially need to run single process with MK 64 data

JSKenyon commented 4 years ago

That should only be true in cases where you have absolutely monstrous chunk sizes. Bear in mind that the memory growth problem has been resolved. My advice above is really only in the interim, while I get a PR ready.

o-smirnov commented 4 years ago

@JSKenyon why do you think it's not thread-safe? Unless I'm missing something, the body of the loop is strictly per-baseline, so why should it not do a prange() over baselines?

(And how in the world did it compile before?)

This is a bit confusing @SpheMakh. In the test you linked madmax is not enabled, but in the log you sent it is. What is the intention?

Sphe disabled it to let the test run through, the error arises when it's enabled.

Is madmax strictly necessary for a release though?

Much like radio astronomy in general, it is not strictly necessary, just awfully nice to have.

JSKenyon commented 4 years ago

@o-smirnov There has just been a massive numba release - there are no guarantees that it hasn't gotten smarter/stricter/broken in some way. Removing the prange is an easy thing to revert when we pin it down (there also appear to be some issues in numba's parfor implementation at the moment, so we might just need to wait for a numba minor version.)

JSKenyon commented 4 years ago

Although as you say, on further inspection, it should be safe. Hmmmmm. I am willing to chalk this up to a regression/teething issue with the new numba. Will try isolate a little further, though if it goes into numba internals this might be one which fixes itself in a few weeks.

JSKenyon commented 4 years ago

I have opened a bug report on the Numba repo. We will see what they say.

o-smirnov commented 4 years ago

Is your MWR the same? You've got n_valid_vals set up outside the prange -- whereas in the madmax code it's inside the prange.

o-smirnov commented 4 years ago

So here's a thought, what if we move the body of the loop out into a separate function (thus breaking scope w.r.t. all the other variables)? Maybe that will be enough to convince Numba?

JSKenyon commented 4 years ago

Yeah - that makes no difference as it should become thread local anyway (I just moved it out to try simplify things as much as possible). The trigger is the if statement which increments n_valid_vals. That somehow causes the problem.

I have found a way around the problem by replacing n_valid_vals with a single element array but I do not like that solution. I would like to see what the Numba devs say first before I commit to dancing around the issue.

JSKenyon commented 4 years ago

Regarding your suggestion @o-smirnov, I am not sure that would resolve it but I can give it a go with my MWR.

o-smirnov commented 4 years ago

Well it's clearly a bug in Numba, so my instinct is to rearrange the deck chairs to see if it goes away...

JSKenyon commented 4 years ago

I have tried moving the internals to a sub-function, but that exposes even weirder behaviour. It works - exactly once. And then never runs again. So I am happy to wait for a response on the Numba issue.

JSKenyon commented 4 years ago

It has been marked as a bug in the Numba thread. Will restore prange in the madmax functions once it is resolved.

JSKenyon commented 4 years ago

This is fixed on Numba master, so we should be able to revert this change after the next release.

JSKenyon commented 4 years ago

@SpheMakh @o-smirnov I have reverted my patch in #406 - this is working again in the latest numba.