lofar-astron / factor

Facet calibration for LOFAR
http://www.astron.nl/citt/facet-doc
GNU General Public License v2.0
19 stars 12 forks source link

Merge selfcal parmdbs fails. #223

Closed tikk3r closed 5 years ago

tikk3r commented 5 years ago

Not really a Factor issue I think, but now I get a crash at the convert_merged_selfcal_parmdbs step. I've investigated a bit and when trying to run the script outside the pipeline I found out that the getValuesGrid function fails on the parmdb (I tested on <name>.merge_phase_parmdbs) with this error:

pdb = lp.parmdb('<long name>_chunk11_127A48492t_0g.merge_phase_parmdbs')
pdb.getValuesGrid('CommonScalarPhase:CS001HBA1') # for example

RuntimeError: Assertion: itsUpper[i-1] <= itsLower[i] || casa::near(itsUpper[i-1],itsLower[i])

I saw another ticket from last year with the same error message, but didn't see a solution: https://github.com/revoltek/losoto/issues/21

@rvweeren @tammojan do you maybe remember/know what's causing this or how to solve it?

tammojan commented 5 years ago

This is because a solution appears multiple times in a parmdb. Was there an old parmdb before you re-ran?

tikk3r commented 5 years ago

Not that I'm aware of. I'll try removing all and restarting it then.

tikk3r commented 5 years ago

The error is still there. I've also tried a different operation (outlierpeel), but that gives the same error as well, even if I remove all parmdbs in the results directory and start again.

darafferty commented 5 years ago

Can you send me (or point me to) the *.merge_phase_parmdbs and *.smooth_amp2 parmdbs? I will try to debug the conversion script.

tikk3r commented 5 years ago

Can you send me (or point me to) the *.merge_phase_parmdbs and *.smooth_amp2 parmdbs? I will try to debug the conversion script.

Here they are: https://surfdrive.surf.nl/files/index.php/s/ROHM24MYeG9znID Parmdbplot also fails with the same error on the *.merge_phase_parmdbs. I can plot the *.smooth_amp2 parmdb just fine, so it looks like something's up with the former.

darafferty commented 5 years ago

As you say, reading the solutions from *.merge_phase_parmdbs is where the problem occurs. Strangely, this parmdb was used in the previous step (merge_selfcal_parmdbs)! So maybe something went wrong there. In fact, this step uses shutil.rmtree() that gave you problems before, so I wonder if it could be causing problems here too. Can you try changing the shutil.rmtree() calls in scripts/merge_parmdbs_selfcal.py to os.system() calls (as in issue #222) and rerunning?

tikk3r commented 5 years ago

I have deleted all parmdbs in the directory and replace all occurrences of shutil.rmtree() with its os.system() equivalent. Then I restarted it just after the shift_cal step. Unfortunately the problem is still there.

tikk3r commented 5 years ago

After testing manually with all the *.mssort_into_Groups/instrument going in and the merge_parbdms_in_time.py script, I found out some parmdbs work and others don't. I can plot each one separately just fine, without any errors. From the 11 chunks, chunks 0, 3, 4, 5, 8, 10 and 11 can be combined. Including any of chunks 1, 2, 6, 7, or 9 and I get the error.

Checking the lenghts (e.g len(pd.getValuesGrid('*')['CommonScalarPhase:CS004HBA1']['values'])), they are not all of equaly lenght, but neither are the chunks that do work, especially 11 is almost twice as long. Could it be some unlucky split in time that makes them overlap or something like that? If you are interested in checking, I uploaded the parmdbs from each chunk here.

darafferty commented 5 years ago

Thanks for investigating this -- I was just about to ask for these parmdbs. It's a bit of a mystery to me as to why some work and some don't, as I haven't seen any problems before with the chunking.

tikk3r commented 5 years ago

I have just tried imaging a different facet first, by rearranging the order in the factor_directions.txt file and this one is at the prepare_imaging_data step now, without any signs of errors during merging... Could it be something odd with the facet itself in the crashing case?

darafferty commented 5 years ago

Hmm -- could be. The chunks are the same for every direction, but the solution intervals can differ. I am still investigating...

darafferty commented 5 years ago

It seems that the widths (in time) of some of the solutions did not match the spacing between solutions. I'm not sure how this happened (or why it didn't happen for all directions), but I put in a check for such a case that will hopefully prevent the above error. Can you get the latest master and try again?

tikk3r commented 5 years ago

This has fixed the problem. It now continues past this step. Thanks for your help!