Closed AntoniaR closed 9 years ago
A summary of our discussion on HipChat (please correct me if I have misunderstood), some comments and a proposed route to resolve this issue in short and long term...
Summary: A first cut on sources for association using a radius equal to bmaj is far too small. Causing major issues in a known dataset but also leads to more general statistical and extended source problems.
Cause of issue: In the source association code, we first select a region with a radius equal to the restoring beam FWHM. Then sources within this region are associated. This significantly speeds up processing as should reduce the number of many-to-many. Without sub-selecting a region, source association would be extremely slow and hence we do need to do this. This code has worked fine for high quality LOFAR datasets and we are likely missing the few examples where it is failing.
Specific problem encountered: I am now processing a significantly larger dataset which has known ionospheric issues. The systematic position offset is deliberately designed to help counter position uncertainty issues. So I have been using an offset which is consistent with the ionospheric effects. However, the known ionospheric and calibration shifts can be up to 100" for some observations and sources (typically much less but I need it to work all the time). In the example I posted, the two sources were separated by 6 beam radii. Additionally, I had deliberately cranked up the systematic position offset to attempt to force source associations. This was a deliberately extreme example as these are easiest to spot and diagnose when problems occur in a massive dataset. As stated above, the upper range of the known offsets can be of the order ~100" which is equivalent to ~2 beam FWHM (bmaj = bmin = 0.015 deg or 54" or ~1 pixel). Hence, I typically use a systematic position uncertainty of ~3 pixels or 168". Offsets of >54" are rare but certainly not uncommon, which means that when you are doing ~20 million source associations you will exceed the bmaj cut reasonably frequently. This means that there are clear source association problems throughout the dataset as this is causing the vast majority of >3000 new sources identified and is giving 'drop-outs' in known stable source light curves where TraP thinks the source has disappeared and does a force fit. The 'drop-outs' are causing major issues in variability searches while having >3000 new sources is playing major havoc with the transient searches (as they are high sigma detections).
Statistics concerns: When we are using the de Ruiter radius, we specify a threshold which directly corresponds to the number of missed associations. I have chosen a de Ruiter radius of 5.68, corresponding to a missed association rate of 10^-7. This means that in my ~20 million source associations, statistically I want to only miss ~1. However, as the position errors can often be in excess of 1 x bmaj (especially for resolved or extended sources - see next concern), sources with a de Ruiter radius < 5.68 are still not going to be associated and hence we cannot say that we are only missing 10^-7 associations - we will be missing far more and we are only doing a very basic source association procedure that is almost making the, much smarter, de Ruiter radius method redundant.
Extended source concerns: By their very nature, extended sources have sizes > bmaj. This is fine even when force fitting as the additional uncertainty regarding the shape is folded into the positional uncertainties. If the extended source is resolved at a later date, we want the resolved components to be associated with the original extended source. Using only the de Ruiter radius, these sources would be correctly associated via a 1-to-many association - taking into account the larger positional uncertainties for an extended source. However, we are not doing this. In the current code we will only associate sources within bmaj, so the resolved components are not associated with the original source. Instead, the known extended source is 'undetected' and the resolved components are labelled as completely new, unassociated sources.
Proposed Solution - Short term: Implement a bug fix which increases the first cut to a fixed and hardcoded integer multiple of bmaj and allow the de Ruiter radius to more finely control source association. The only disadvantage I can see with this option is that source association will take longer (though this is not the section slowing TraP down). This would require an extremely minor change to existing TraP code (here: https://github.com/transientskp/tkp/blob/release2/tkp/db/associations.py#L551 ). I would suggest using 5x bmaj, but am open to suggestions about the value of this integer. I would like to see this hard coded as a matter of urgency as I have a dataset that needs processing within the next month and it takes ~2 weeks to process... If the team do not want to have this as a hardcoded number in the Release 2.0 branch, a further option is to have it hardcoded in a separate branch built on Struis (as with the AARTFAAC branch) that can be removed once the long term solution is in place.
Proposed Solution - Long term: I fully appreciate that the bigger you make the first radius cut, the longer that source association will take. This is likely to cause issues if people want to process data in real time scenarios where perhaps they are happy to be more lax on association. But, if people don't mind how long TraP takes or really want to be confident that sources are properly associated, they may prefer to allow the de Ruiter radius to take full control of source association. So I propose a new option in the job_params.cfg where users specify their own integer multiple for the number of bmaj used for the initial radius cut.
If we agree we can set the search area to 5*smaj, I can send the pull request (tests pass).
On Hipchat, Gijs and I have agreed to put this change into a branch and make a separate build with this change for testing purposes on Struis - rather than pushing a change straight into Release 2 before it's checked that it behaves as expected (Tim had quite rightly raised this concern). Today, I planned to make this code change but don't have the ability to run the tests yet - so it was a bit of a quick fix. But it sounds like you have already done this and run the unit tests? If yes, I think it makes sense that your tested version is the one that Gijs puts into a special build on Struis.
I pushed the changes to branch prob421 in my fork: https://github.com/bartscheers/tkp/tree/prob421, but did not send a pull request. You can pull it from there to where you want. Let me know if I need to do anything more.
I've copied this branch to the transientkp github tree, otherwise I can't tag it and build it on struis. Bart, can you switch to this branch?
There appears to be a bug in source association in Release 2.0. Sources do not appear to be always associated correctly by Release 2.0 and this is causing major issues with TraP use on large datasets.
I have identified an example of this in action in one of my datasets. The three sources in this issue should all be the same source (using de Ruiter radius). There is one relatively constant runcat source (multiple detections, id 33589) with two "new sources". The two new sources listed below are only detected once each and hence the two runcat positions remain constant after detection. One of the two new sources is detected when the runcat source 33589 is not blindly detected in the image - so should have been associated. The other, 35391, should be a 1-to-many source association type but this does not seem to be the case in the catalogue.
The database details and Banana URLs for the relevant sources have been pasted into HipChat. The de Ruiter radius was set to ~5.6 in the job parameters file.
There are 2 faint sources in this region of the image; 1 easily detected - source 33589 - and a probably confusing source a small distance away, likely causing the sources 34809 and 35391. However, the de Ruiter radius calculations show that these sources should be associated (as it doesn't care about the actual image properties, just the source positions and errors).
Why were these sources not associated? Caused by the runcat position of 33589 at the time of source association step (I think unlikely due to the big systematic error inserted in the job parameters to force the association) or a bug in the code?