Closed SpheMakh closed 4 years ago
Is madmax strictly necessary for a release though?
This is a bit confusing @SpheMakh. In the test you linked madmax is not enabled, but in the log you sent it is. What is the intention?
Ok, I can reproduce this locally. @o-smirnov https://github.com/ratt-ru/CubiCal/blob/ba23a6395196fa0bc750d4d07cc23bfd81644f5c/cubical/data_handler/ms_tile.py#L774
This is merely incidental, found it while trying to reproduce. Just an incomplete call and I suspect that the axis argument is incorrect.
The real issue is numba and the prange
here:
https://github.com/ratt-ru/CubiCal/blob/ba23a6395196fa0bc750d4d07cc23bfd81644f5c/cubical/kernels/madmax.py#L205
Looking at the function I don't actually think it is thread safe. If I cut out some of the dangerous looking operations it compiles again. I suggest we ditch the prange here for now.
Anyone who is being blocked by these errors, just avoid using --dist-nthread
and instead rely on the multiprocessing.
Can't really do that @JSKenyon. You essentially need to run single process with MK 64 data
That should only be true in cases where you have absolutely monstrous chunk sizes. Bear in mind that the memory growth problem has been resolved. My advice above is really only in the interim, while I get a PR ready.
@JSKenyon why do you think it's not thread-safe? Unless I'm missing something, the body of the loop is strictly per-baseline, so why should it not do a prange()
over baselines?
(And how in the world did it compile before?)
This is a bit confusing @SpheMakh. In the test you linked madmax is not enabled, but in the log you sent it is. What is the intention?
Sphe disabled it to let the test run through, the error arises when it's enabled.
Is madmax strictly necessary for a release though?
Much like radio astronomy in general, it is not strictly necessary, just awfully nice to have.
@o-smirnov There has just been a massive numba release - there are no guarantees that it hasn't gotten smarter/stricter/broken in some way. Removing the prange is an easy thing to revert when we pin it down (there also appear to be some issues in numba's parfor implementation at the moment, so we might just need to wait for a numba minor version.)
Although as you say, on further inspection, it should be safe. Hmmmmm. I am willing to chalk this up to a regression/teething issue with the new numba. Will try isolate a little further, though if it goes into numba internals this might be one which fixes itself in a few weeks.
I have opened a bug report on the Numba repo. We will see what they say.
Is your MWR the same? You've got n_valid_vals
set up outside the prange -- whereas in the madmax code it's inside the prange.
So here's a thought, what if we move the body of the loop out into a separate function (thus breaking scope w.r.t. all the other variables)? Maybe that will be enough to convince Numba?
Yeah - that makes no difference as it should become thread local anyway (I just moved it out to try simplify things as much as possible). The trigger is the if statement which increments n_valid_vals. That somehow causes the problem.
I have found a way around the problem by replacing n_valid_vals
with a single element array but I do not like that solution. I would like to see what the Numba devs say first before I commit to dancing around the issue.
Regarding your suggestion @o-smirnov, I am not sure that would resolve it but I can give it a go with my MWR.
Well it's clearly a bug in Numba, so my instinct is to rearrange the deck chairs to see if it goes away...
I have tried moving the internals to a sub-function, but that exposes even weirder behaviour. It works - exactly once. And then never runs again. So I am happy to wait for a response on the Numba issue.
It has been marked as a bug in the Numba thread. Will restore prange in the madmax functions once it is resolved.
This is fixed on Numba master, so we should be able to revert this change after the next release.
@SpheMakh @o-smirnov I have reverted my patch in #406 - this is working again in the latest numba.
Find the full log here