ForeignKeyViolation when running multiple frequencies

dentalfloss1 commented 1 year ago

I was running LWA data through trap with multiple frequencies and at 269/363 images I ran into the following error:

2023-06-13 16:22:54 ERROR tkp.db.database: Query failed: (psycopg2.errors.ForeignKeyViolation) insert or update on table "extractedsource" violates foreign key constraint "extractedsource_ff_runcat_fkey"
DETAIL:  Key (ff_runcat)=(11666) is not present in table "runningcatalog".

Since using the latest TraP version, 6 (is that right?), this is the first time I have seen this error appear so it seems to be something with having multiple images on the same timestep. Attached are the logs.

trap_debug_log.txt traplog.txt

HannoSpreeuw commented 1 year ago

Very interesting that you report this and thanks for the logs.

Is this from the master branch?

Does the problem persist when running from the Fix_SAWarning_copying_to_the_same_column_by_two_relationships branch?

dentalfloss1 commented 1 year ago

This is from the r6.0c tag, which I believe is essentially the same as the master branch right? I am trying this branch now.

From: Hanno Spreeuw @.> Sent: Thursday, June 22, 2023 5:27 AM To: transientskp/tkp @.> Cc: Sarah Chastain @.>; Author @.> Subject: Re: [transientskp/tkp] ForeignKeyViolation when running multiple frequencies (Issue #615)

[EXTERNAL]

Very interesting that you report this and thanks for the logs.

Is this from the master branch?

Does the problem persist when running from the Fix_SAWarning_copying_to_the_same_column_by_two_relationships branch?

— Reply to this email directly, view it on GitHubhttps://github.com/transientskp/tkp/issues/615#issuecomment-1602474218, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF35E64DUUV4Z5EIZUSMLTTXMQT25ANCNFSM6AAAAAAZGO2JN4. You are receiving this because you authored the thread.Message ID: @.***>

dentalfloss1 commented 1 year ago

It appears to produce the same error. trapdebuglog.txt traplog.txt

HannoSpreeuw commented 1 year ago

Thanks for checking this.

HannoSpreeuw commented 1 year ago

Trying to find out if this has anything to do with the Python 2 to 3 conversion of TraP and/or the simultaneous upgrade of SQLAlchemy.....

Did you perhaps also run a Python 2 version of TraP on this or a similar dataset? If yes, was this successful?

If no, I might have to get my hands on a multi-frequency dataset myself.

dentalfloss1 commented 1 year ago

edit: It would be easier for me to send you the data. Just let me know what would be the easiest way to do that

HannoSpreeuw commented 1 year ago

Could you perhaps send me a link to a Dropbox folder?

dentalfloss1 commented 1 year ago

https://unmm-my.sharepoint.com/:f:/g/personal/sarahchastain1_unm_edu/ErkdYQoMajxKnLz5kJrYHyYBpHZ11CSVY0DXpdmdrpVodg?e=TkDEFy

From: Hanno Spreeuw @.> Sent: Monday, June 26, 2023 1:58 PM To: transientskp/tkp @.> Cc: Sarah Chastain @.>; Author @.> Subject: Re: [transientskp/tkp] ForeignKeyViolation when running multiple frequencies (Issue #615)

[EXTERNAL]

Could you perhaps send me a link to a Dropbox folder?

— Reply to this email directly, view it on GitHubhttps://github.com/transientskp/tkp/issues/615#issuecomment-1608146068, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF35E66MI4XXN7QCQHX655LXNHSVZANCNFSM6AAAAAAZGO2JN4. You are receiving this because you authored the thread.Message ID: @.***>

HannoSpreeuw commented 1 year ago

Thanks. I was able to download these 2178 FITS files.

HannoSpreeuw commented 1 year ago

It completed with lots of warnings of this type: ` ../../tkp/sourcefinder/stats.py:15: RuntimeWarning: divide by zero encountered in divide help1 = clip_limit/(sigma*numpy.sqrt(2)) ../../tkp/sourcefinder/stats.py:17: RuntimeWarning: invalid value encountered in multiply return sigma2(help2-2numpy.sqrt(2)help1numpy.exp(-help12))-clipped_std*2help2 ../../lib/python3.10/site-packages/scipy/optimize/_minpack_py.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations. warnings.warn(msg, RuntimeWarning) `

So problems with kappa, sigma clipping. Can be caused by images that are not well calibrated. Like this one: oims_25.100MHz_2023-05-30T08-23-05_stokesI.fits.

Perhaps you could also provide me with your pipeline.cfg, job_params.cfg and inject.cfg files, since I was not yet able to reproduce your error.

dentalfloss1 commented 1 year ago

injectcfg.txt job_paramscfg.txt pipeline_cfg.txt

Yea, some of them, perhaps even a large number are garbage. I was hoping to use the quality control to get rid of those automatically.

HannoSpreeuw commented 1 year ago

I can now produce a similar error. Or the opposite error DETAIL: Key (runcat)=(96) already exists actually:

scipy/optimize/_minpack_py.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
WARNING:root:position errors extend outside image
WARNING:root:position errors extend outside image
WARNING:root:position errors extend outside image
09:44:24 INFO tkp.main: performed 136 forced fits in 6 images
09:44:24 INFO tkp.main: calculating variability metrics
09:44:24 ERROR tkp.main: timestep raised <class 'sqlalchemy.exc.IntegrityError'> exception: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ix_varmetric_runcat"
DETAIL:  Key (runcat)=(96) already exists.

[SQL: INSERT INTO varmetric (runcat, v_int, eta_int, band, newsource, sigma_rms_max, sigma_rms_min, lightcurve_max, lightcurve_avg, lightcurve_median) SELECT runcat, v_int, eta_int, band, newsource, sigma_rms_max, sigma_rms_min, lightcurve_max, lightcurve_avg, lightcurve_median 
FROM (SELECT r.id AS runcat, r.wm_ra AS ra, r.wm_decl AS decl, r.wm_uncertainty_ew AS wm_uncertainty_ew, r.wm_uncertainty_ns AS wm_uncertainty_ns, r.xtrsrc AS xtrsrc, r.dataset AS dataset_id, r.datapoints AS datapoints, match_assoc.v_int AS v_int, match_assoc.eta_int AS eta_int, match_img.band AS band, newsrc_trigger.id AS newsource, newsrc_trigger.sigma_rms_max AS sigma_rms_max, newsrc_trigger.sigma_rms_min AS sigma_rms_min, max(agg_ex.f_int) AS lightcurve_max, avg(agg_ex.f_int) AS lightcurve_avg, median(agg_ex.f_int) AS lightcurve_median 
FROM (SELECT a_lt.runcat AS runcat_id, max(e_lt.f_int) AS max_flux 
FROM (SELECT a_laids.id AS assoc_id, last_assoc_timestamps.max_time AS max_time, last_assoc_timestamps.band AS band, last_assoc_timestamps.runcat AS runcat 
FROM (SELECT r_timestamps.id AS runcat, max(i_timestamps.taustart_ts) AS max_time, i_timestamps.band AS band 
FROM runningcatalog AS r_timestamps JOIN assocxtrsource AS a_timestamps ON r_timestamps.id = a_timestamps.runcat JOIN extractedsource AS e_timestamps ON a_timestamps.xtrsrc = e_timestamps.id JOIN image AS i_timestamps ON i_timestamps.id = e_timestamps.image 
WHERE %(param_1)s = i_timestamps.dataset GROUP BY r_timestamps.id, i_timestamps.band) AS last_assoc_timestamps JOIN assocxtrsource AS a_laids ON a_laids.runcat = last_assoc_timestamps.runcat JOIN extractedsource AS e_laids ON a_laids.xtrsrc = e_laids.id JOIN image AS i_laids ON i_laids.id = e_laids.image AND i_laids.taustart_ts = last_assoc_timestamps.max_time) AS last_assoc_per_band JOIN assocxtrsource AS a_lt ON a_lt.id = last_assoc_per_band.assoc_id JOIN extractedsource AS e_lt ON a_lt.xtrsrc = e_lt.id GROUP BY a_lt.runcat) AS last_ts_fmax JOIN assocxtrsource AS match_assoc ON match_assoc.runcat = last_ts_fmax.runcat_id JOIN extractedsource AS match_ex ON match_assoc.xtrsrc = match_ex.id AND match_ex.f_int = last_ts_fmax.max_flux JOIN runningcatalog AS r ON r.id = last_ts_fmax.runcat_id JOIN image AS match_img ON match_ex.image = match_img.id LEFT OUTER JOIN (SELECT n_ntr.id AS id, n_ntr.runcat AS rc_id, e_ntr.f_int / i_ntr.rms_min AS sigma_rms_min, e_ntr.f_int / i_ntr.rms_max AS sigma_rms_max 
FROM newsource AS n_ntr JOIN extractedsource AS e_ntr ON e_ntr.id = n_ntr.trigger_xtrsrc JOIN image AS i_ntr ON i_ntr.id = n_ntr.previous_limits_image 
WHERE %(param_2)s = i_ntr.dataset) AS newsrc_trigger ON newsrc_trigger.rc_id = r.id JOIN assocxtrsource AS agg_assoc ON r.id = agg_assoc.runcat JOIN extractedsource AS agg_ex ON agg_assoc.xtrsrc = agg_ex.id JOIN image AS agg_img ON agg_ex.image = agg_img.id AND agg_img.band = match_img.band 
WHERE %(param_3)s = r.dataset GROUP BY r.id, r.wm_ra, r.wm_decl, r.wm_uncertainty_ew, r.wm_uncertainty_ns, r.xtrsrc, r.dataset, r.datapoints, match_assoc.v_int, match_assoc.eta_int, match_img.band, newsrc_trigger.id, newsrc_trigger.sigma_rms_max, newsrc_trigger.sigma_rms_min) AS anon_1]
[parameters: {'param_1': 1, 'param_2': 1, 'param_3': 1}]
(Background on this error at: http://sqlalche.me/e/14/gkpj)

This happens directly after problems with kappa, sigma clipping, so that may be related.

HannoSpreeuw commented 1 year ago

method = "serial" instead of multiproc does not seem to make a difference.

HannoSpreeuw commented 1 year ago

I will try again with the latest Python 2 release of TraP.

HannoSpreeuw commented 1 year ago

Same error

14:59:26 ERROR tkp.main: timestep raised <class 'sqlalchemy.exc.IntegrityError'> exception: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "ix_varmetric_runcat"
DETAIL:  Key (runcat)=(96) already exists.

using TraP r5.0 with Python 2.

Good to know that this does not come from the Python 2 --> 3 conversion with its simultaneous SQLAlchemy upgrade.

HannoSpreeuw commented 1 year ago

this is the first time I have seen this error appear so it seems to be something with having multiple images on the same timestep.

Before spending more time on this, I want to check with @AntoniaR and @jdswinbank if TraP should provide for this.

AntoniaR commented 1 year ago

Hi both, just had chance to catch up on this issue.

I think this is an issue that I have seen before. It is basically saying that the source it is trying to update has been removed from the database. My guess is that the following is happening:

TraP finds sources in multiple images in a given timestep
TraP conducts source association
TraP conducts force fits
Guess - TraP is trying to update a forced fit to a source that has not been inserted yet
TraP inserts sources in the database from one timestep

This is the section of code I am refering to: https://github.com/transientskp/tkp/blob/7a1a175b7205aac13d559de0afd1a02637f83a05/tkp/main.py#L362-L377

The behaviour we wanted was that TraP did not include forced fit measurements for newly detected sources in that specific timestep. This was to prevent issues like this.

Does this help track down the error?

@dentalfloss1 in the mean time, this error could be being triggered by your current settings. I would strongly recommend doing some image plane quality control before running TraP - then you can either give it max and min rms values or a list of good images to process - see this code for an example https://github.com/transientskp/TraP_tools/tree/master/PreTraPimageQC/QC.

Then I would also look at the source finding and association settings as I expect these to be somewhat different for LWA data. From the job_params.txt you sent earlier, I would change:

ew_sys_err = 10              
; Systematic errors on ra & decl (units in arcsec)
ns_sys_err = 10

This is likely much higher for LWA images as they are much lower resolution than MeerKAT. I'd recommend testing what the typical position offset is outside of TraP and then use that value. I would expect it to be of order arcmin rather than arcsec.

Also

beamwidths_limit =  1.0

I tend to set this to 3.0 as this gets more reliable source associations - especially if sources are shifting slightly due to ionospheric effects (which are stronger at LWA frequencies).

You may also want to investigate the back_size values as these seem quite small - but it does depend on your images.

I would also strongly recommend using a higher detection threshold than 5 sigma until you are confident with your other settings.

dentalfloss1 commented 1 year ago

I did a little digging on this and it turns out, it's quite interesting. There is a runningcatalog source that gets returned in one of the images in the timestep by tkp.db.nulldetections.get_nulldetections(). This source doesn't show up in any of the blind extractions from tkp.main.store_extractions(). And furthermore, when you issue the same sql command after trap quits, the source no longer shows up.

HannoSpreeuw commented 7 months ago

@dentalfloss1 Sorry for my late response. This sounds like a bug in tkp.db.nulldetections.get_nulldetections. Would you draw the same conclusion?

dentalfloss1 commented 7 months ago

Maybe? I don't really recall at this point. I remember wondering if the problem had something to do with SQLAlchemy trying to run SQL commands out of order or at the same time. But I never had a satisfying conclusion to it. For my case I ended up turning off the forced fits and only did blind extractions, which is not really a solution

AntoniaR commented 3 weeks ago

We should check if this is still an issue in the R7 TraP

dentalfloss1 commented 3 weeks ago

Let me know if you need any info from me or want me to try anything.

AntoniaR commented 3 weeks ago

@dentalfloss1 do you have a small dataset that can reproduce the error? Then we can test it in the new TraP. Thanks!

dentalfloss1 commented 3 weeks ago

Annoyingly, I can't seem to replicate this issue anymore. I guess you can close it?

dentalfloss1 commented 3 weeks ago

I've even tried using singularity images of TraP that I previously used.

AntoniaR commented 3 weeks ago

Ok, let's close this issue for now but do please reopen or file a new issue if you see it again. Thanks!

transientskp / tkp

ForeignKeyViolation when running multiple frequencies #615