ratt-ru / xova

dask-ms/codex-africanus MS averaging tool
Other
0 stars 2 forks source link

BDA: process two columns at once #17

Open o-smirnov opened 4 years ago

o-smirnov commented 4 years ago

@JSKenyon asked me for a reasonable MODEL_DATA column. I can generate a BDAd model in a separate MS with a separate run of xova bda, but then he has to copy that across into the original BDAd data MS somehow.

Is there an easy fix to run the averaging on more than one column at a time?

o-smirnov commented 4 years ago

@bennahugo wrong issue, wrong repo... wrong organization, even! ;)

bennahugo commented 4 years ago

Woops sorry about that this is why you should not look at github issues via mail at that point in the morning!

o-smirnov commented 4 years ago

Valuable advice though! (Including the misdirected comment...)

sjperkins commented 4 years ago

@JSKenyon asked me for a reasonable MODEL_DATA column. I can generate a BDAd model in a separate MS with a separate run of xova bda, but then he has to copy that across into the original BDAd data MS somehow.

Is there an easy fix to run the averaging on more than one column at a time?

@o-smirnov The short of it is the codex dask/numba layer currently averages a single visibility array. I have considered averaging multiple vis arrays -- it'll require some extra numba + dask magic (support a tuple of vis arrays as an arg, vs a single vis array).

JSKenyon commented 4 years ago

At least that is one mystery solved - I was very confused when I read your reply @bennahugo!

sjperkins commented 4 years ago

I'd say it's not a quick quick fix -- mediumish, but something that has been on my radar (along with lexicographical sort by MSv2 ID columns)

o-smirnov commented 4 years ago

It's less than critical functionality so let's look for easy ways. How about you add an option to insert a new output column into an existing MS instead, would that be easier?

Or is it easier to just write a bit of code that copies odd-shaped columns across?..

JSKenyon commented 4 years ago

I am a little confused though - you already average multiple columns @sjperkins. At least I assume so, as you produce an MS with multiple columns. The MODEL_DATA column will be averaged identically to the DATA column (at least in the sane world where they have the same dimensions).

JSKenyon commented 4 years ago

It's less than critical functionality so let's look for easy ways. How about you add an option to insert a new output column into an existing MS instead, would that be easier?

Or is it easier to just write a bit of code that copies odd-shaped columns across?..

Do you have a Tigger sky model for the data? I could maybe try predict on the fly. No idea if it will work but worth trying.

JSKenyon commented 4 years ago

Also worth testing - at some vague point in the future - the repercussions of using an averaged model vs. repredicting at the BDA coordinates.

sjperkins commented 4 years ago

Also worth testing - at some vague point in the future - the repercussions of using an averaged model vs. repredicting at the BDA coordinates.

Yes, very interested in this, /cc @paoloserra

o-smirnov commented 4 years ago

No, I only have WSCLEAN images at the moment, no LSMs. I'd like to give you the best possible model for now (obtained through selfcal without BDA), so that you really test the quality of calibration first, and not the model...

The issue is that WSCLEAN can't work with a BDAd MS for now, so I must predict this image at full res, then average. The averaging is now in progress. I'll give you a separate MS with the BDAd model, you'll have to copy the column across.

Then of course there is also the model that we get from DDFacet. We could predict that to a BDAd MS. Now that really starts testing the whole BDA+DDF+selfcal loop. Which is too many variables to start with, IMHO.

JSKenyon commented 4 years ago

Yeah - that sounds like too many points of failure. I will be happy with just producing reasonable solutions. As I mentioned, still have some changes to make but I think that internals of the solver are "correct". Put it this way - if I set both the model to be the DATA column, I get identity gains out. Not very meaningful, but at least suggests that something is going right.

sjperkins commented 4 years ago

@o-smirnov @JSKenyon Can you hold off till end of day? I might be able to do something with a day of "meetings" ;-)

JSKenyon commented 4 years ago

Oh no worries - copying columns over etc. is fine for now anyway. Think it is just a feature that should exist eventually but it doesn't have to be immediate.

bennahugo commented 4 years ago

I also think there are too many points of failure in re-predicting. We still have to verify the astrometry of the whole thing here, ie. is TIME_CENTROID correct enough to get good uvw coordinates via fixvis first. Once we established that the widefield astrometry is correct one can start thinking about getting the forward predict going for such a dataset. I'm busy trying to establish this.

sjperkins commented 4 years ago

I am a little confused though - you already average multiple columns @sjperkins. At least I assume so, as you produce an MS with multiple columns. The MODEL_DATA column will be averaged identically to the DATA column (at least in the sane world where they have the same dimensions).

I do? Then I've managed to surprise myself. It should only average one column. Maybe the others are zeroed....

JSKenyon commented 4 years ago

I guess I didn't check the contents. :-) But I guess I also just use conventional columns which you may handle differently.

sjperkins commented 4 years ago

Also worth testing - at some vague point in the future - the repercussions of using an averaged model vs. repredicting at the BDA coordinates.

Yes, very interested in this, /cc @paoloserra

Although I guess averaging may be less useful in the spectral line scenario.

sjperkins commented 4 years ago

Currently, xova averages the column specified in the --data--column option (default CORRECTED_DATA) into DATA in the new MS.

I've added functionality in codex to handle averaging mulltple columns at once: https://github.com/ska-sa/codex-africanus/pull/173/commits/f2ddbcd9a8c0622b743852020cb131e3495c7797

I want to propose that the --data-column changes to "DATA, MODEL_DATA, CORRECTED_DATA" and that each column listed is averaged directly into it's counterpart in the output.

Are there any objections to the above change?

JSKenyon commented 4 years ago

I have no objections. Are you using argparse? If so you can use nargs=+ which will make the input a list. That shouldn't break existing code, but will allow for --data-column DATA MODEL_DATA CORRECTED_DATA to be specified on the command line (as a space separated list in this case).

o-smirnov commented 4 years ago

May I suggest an --output-column option too, to override the default mapping of a column to its counterpart.

Reason being, it is a common and convenient use case to take a 1GC MS, and average it down, with CORRECTED_DATA becoming DATA in the output MS -- which can then go straight to selfcal.

o-smirnov commented 4 years ago

...or, perhaps more succinctly, allow for --data-column DATA:CORRECTED_DATA. The colon and the second name being optional.

sjperkins commented 4 years ago

...or, perhaps more succinctly, allow for --data-column DATA:CORRECTED_DATA. The colon and the second name being optional.

I like this version. Is our command line mini-language or conventions formalised anywhere? It would be nice to standardise it across applications. I'd imagine shadems and quartical are currently the main drivers but it's now being requested in applications like xova and tricolour.

o-smirnov commented 4 years ago

I don't think this is really formalized, but colon is the obvious and sensible choice -- it's not arithmetic, and it's not a list separator, and it's Pythonic in the sense that it has the same meaning as in a dict definition...