thespacedoctor / sherlock

The QUB Transient Classifier
GNU General Public License v3.0
11 stars 5 forks source link

Adding PanSTARRS DR1 into Sherlock #70

Closed thespacedoctor closed 4 years ago

thespacedoctor commented 6 years ago

Here are my thoughts on what PanSTARRS data needs extracted from MAST:

http://astronotes.co.uk/blog/2018/06/06/a-query-to-export-panstarrs-dr1-data-for-use-with-sherlock.html

See the query and sample dataset at the end. If everyone's happy I'll send the query off to MAST at the end of the day.

thespacedoctor commented 6 years ago

This page mentions a 'basic s/g separation parameter' to be found in the StackObjectAttributes table:

https://outerspace.stsci.edu/display/PANSTARRS/PS1+Source+extraction+and+catalogs

The columns this comment relates to are the XExtNSigma columns (X = grizy). From the definition:

An extendedness measure for the g filter stack detection based on the deviation between PSF and Kron (1980) magnitudes, normalized by the PSF magnitude uncertainty.

it seems this metric is to be used with NMs first recipe here. Not sure how normalisation with the PSF mag error works, but I think we can assume anything < 0 is a star and > 0 is a galaxy.

It's unclear from ZTF scheme if this is what they're using - I think not (unless there's further normalisation going on):

sgscore1: Star/Galaxy score of closest source from PS1 catalog; if exists within 30 arcsec: 0 <= sgscore <= 1 where closer to 1 implies higher likelihood of being a star

thespacedoctor commented 6 years ago

request sent to A.T.

thespacedoctor commented 6 years ago

Unfortunately AT can't help us - he had no access to PS-DR1 catalogues. I suggest we either put a request in with Armin Rest or Edinburgh. Most likely Edinburgh will be quicker to respond.

thespacedoctor commented 6 years ago

Just reread the ZFT atel:

we employ a machine-learning star-galaxy separator, based on PS1 data

thespacedoctor commented 6 years ago

Download of PanSTARR DR1 started from MAST -- let's see how long it takes to get banned!

http://astronotes.co.uk/blog/2018/06/20/downloading-panstarrs-dr1-catalogue-data.html

thespacedoctor commented 5 years ago

2,345,500,000 rows downloaded so far. I'm going to start getting these rows into the sherlock database. Do we have enough space on psdb3 Ken?

genghisken commented 5 years ago

The simple answer is "no"! However, this is very timely. Robert has setup a new pair of machines (db0 and db1) which have a 16TB RAID0 SSD (NVME) installed. I'm currently installing MySQL on these machines, and when done we should move the crossmatch_catalogues database to one of them. Note that RAID0 means that if any of the SSDs fail, we lose the database entirely. But if I get replication and backups running, this shouldn't be too large a risk.

thespacedoctor commented 4 years ago

The PS1 DR1 catalogue with Tachibana & Miller point-source scores are now loaded into the Sherlock database in one large table. We now need to make some decisions on how to build the catalogue into our search algorithm. I'm going to post some notes (please correct me if you think I've misunderstood any of the machine learning jargon) and then pose some questions.

Executive Summary of Tachibana & Miller Paper - Morphological Classification Model to Define Unresolved PanSTARRS1 Sources

Note unresolved is taken to mean point-like sources (i.e. not a cloud or disc). Asteroids, QSOs, stars and distant/small galaxies will unresolved. Resolved objects will include most nearby/larger galaxies.

Training Set

The model was trained using \~50,000 HST COSMOS morphology classifications and the performance tested against SDSS spectra and Gaia sources (with high-confidence stellar identifications).

Features

Photometry measurements for the model are those taken from the PS1 StackObjectThin and StackObjectAttributes tables; i.e. as measured from the PS1 stacked images and not individual single exposures. The shape measurements from the StackObjectAttributes table are used to identify unresolved sources in the PS1 DR1 catalogue.

The PS1 DR1 catalogues provided 3 measurements of flux in 5 filters:

  1. Aperture photometry
  2. Kron photometry
  3. PSF photometry

To combat the issue of missing data, a set of 'white flux' features are added to the PS1 metrics that involve combining flux measurements across all filters in which a PS1 source is detected (often not in all 5 filters).

Very Basic Star-Galaxy Separation

The PS1 documentation states that for sources with $i < 21$ mag those sources with

can be considered as galaxies. The obvious difference between stellar and galaxy populations in the bright regime is clearly presented in Figure 3 of the paper:

Figure of Merit

The Figure of Merit (FoM) is defined in this model as the True Positive Rate (TPR) that corresponds to a False Positive Rate (FPR) of 0.005.

From Figure 4 in the paper we see that a FPR threshold of 0.005 gives a FoM TPR of \~0.71. So we will be able to flag and remove 71% of the stellar 'contaminants' at the expense of 0.5% of those sources removed actually being galaxies. We of-course are free to define our own FoM value to suit our needs.

Table 3 contains the information we need to make our decision on cuts:

Figure 7 reveals the accuracy of the model for individual stellar and extended sources. As can be seen the accuracy remains decent even down in the faintest regime. Here accuracy is a measure of the stellar/galaxy separation if it is assumed galaxies have ps1_psc < 0.5 and stellar sources ps1_psc > 0.5.

Note most of the ambiguous sources (0.2 < ps1_psc < 0.8) are to be found in the galactic plane where blended stars are hard to define.

thespacedoctor commented 4 years ago

Non-Identified Sources

About half of the sources in the entire 2.9 billion row PS1 DR1 table do not have a point-source score. This is due to the way Tachibana & Miller select the sample to run their classifier against. Not much info is given in the paper about the selection but I think the answer is in this notebook:

https://github.com/adamamiller/PS1_star_galaxy/blob/master/PS1casjobs/PS1features.query

A simplified version of the query just showing the cuts could be written:

SELECT 
    COUNT(*) 
FROM
    StackObjectView
WHERE
    primaryDetection = 1 AND nDetections > 2
GROUP BY objid
HAVING COUNT(objid) = 1;

Running this query on MAST gets me very close to the \~1.5 billion source count found in the Tachibana & Miller sample.

From my understanding, the vast majority of the sources that do not make the cut are sources detected in the stacked images that are not detected in the individual warp images.

thespacedoctor commented 4 years ago

Questions/Discussion

  1. What's the maximum fraction of galaxies we're willing to sacrifice to remove stars from the transient stream?
  2. Do we want to stagger this fraction by magnitude? By moving from a 0.5% FPR to a 1% FPR for sources < 21 mag we remove 96% of the stars instead of 80% (see table 3).
  3. How do we use the \~1.4 billion fainter sources not in the Tachibana & Miller catalogue?

DRY answers

  1. I think we start with 0.5% FPR and if we are still finding too many stellar contaminants we can adjust the threshold.
  2. My gut feeling here is no. Not sure there's much to be gained for the risk of throwing out real transients. Our main issues are with stars fainter than 21 mag that flare above survey detection limits.
  3. Too risky to make any judgement - could mark these sources as UNCLEAR to make the eyeballer at least aware that there is something there in the PS1 stack images.
thespacedoctor commented 4 years ago

Views created on PS1 catalogue for stars, galaxies and unknown:

-- PS1 Stars
CREATE VIEW `tcs_view_star_ps1_dr1` AS
    SELECT 
        *
    FROM
        tcs_cat_ps1_dr1
    WHERE
        ps_score >= 0.83 AND ps_score IS NOT NULL;

-- PS1 Galaxies
CREATE VIEW `tcs_view_galaxy_ps1_dr1` AS
    SELECT 
        *
    FROM
        tcs_cat_ps1_dr1
    WHERE
        ps_score < 0.83 AND ps_score IS NOT NULL;

-- PS1 Unclear
CREATE VIEW `tcs_view_unclear_ps1_dr1` AS
    SELECT 
        *
    FROM
        tcs_cat_ps1_dr1
    WHERE
        ps_score IS NULL;
genghisken commented 4 years ago

Great. I'll need to remember to apply these in Edinburgh. I'll add a github action for myself in Lasair. How did you arrive at the 0.83 magic number? Is this the 0.5 FPR?

thespacedoctor commented 4 years ago

Yes. Table 3, first row under the 0.005 column. I think I remember this as one of the numbers Frank Masci suggested to use in the cuts when siphoning off the best transients from the ZTF stream for the brokers.

genghisken commented 4 years ago

Great. Very unlikely to happen with a float, and have an exact score of 0.830000, but we should make one of those inequalities >= or <=.

thespacedoctor commented 4 years ago

very good point. Updated star view, see above ↑ (and in database)

genghisken commented 4 years ago

Also - I like the idea of tagging objects as UNCLEAR for the no-score PS1 objects. At least it removes the "ORPHAN" classification and indicates that there is something there - even if we can't do much with it. What's the disadvantage?

smarttgit commented 4 years ago

Read the Tachibana & Miller paper again. Answers to Dave's questions :

Yes, agree. 0.83 is probably optimal

No mag dependent variation, just a straight 0.83 cut

The 50% that have no RF score. I guess these are mostly (or exclusively) at the faint end ? Are they roughly >21 ? Then I would use the offset to decide further. If the transient is offset, then likely this is a galaxy and hence the transient is more likely a SN :

if Object_coords < 1.5" from a PSObject which has r >21 or i >21 then classify as UNCLEAR and report offset, magnitude of the PSObject (and name if possible)

elseif 1.5" < Object_coords < 3.0" from a PSObject with r > 21 or i > 21 then classify as SN and report offset, magnitude of the PSObject (and name if possible)

thespacedoctor commented 4 years ago

I have the algorithm setup as above now. Running some test.

ATLAS20his originally an ORPHAN is now a SN! Congratulations ATLAS20his!!

Transient's Predicted Classification: SN
Suggested Associations:
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
| association type  | rank  | rankScore  | catalogue table name  | catalogue object id  | catalogue object type  | catalogue object subtype  | raDeg       | decDeg       | separationArcsec  | physical separation kpc  | direct distance  | distance  | z     | photoZ  | photoZErr  | Mag    | MagFilter  | MagErr  | classification reliability  | merged rank  |
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
| SN                | 1     | 2005.00    | PS1                   | 94280837680782324    | galaxy                 | multiple                  | 83.7681061  | -11.4316385  | 0.15              |                          |                  |           |       |         |            | 22.56  | r          | 0.01    | association                 |              |
| SN                |       | 2005.00    | PanSTARRS DR1         | 94280837680782324    | galaxy                 |                           | 83.7681061  | -11.4316385  | 0.15              |                          |                  |           |       |         |            | 22.56  | r          | 0.01    | association                 | 1            |
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
thespacedoctor commented 4 years ago

Same for ZTF20aasoaeu:

Transient's Predicted Classification: SN
Suggested Associations:
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
| association type  | rank  | rankScore  | catalogue table name  | catalogue object id  | catalogue object type  | catalogue object subtype  | raDeg       | decDeg       | separationArcsec  | physical separation kpc  | direct distance  | distance  | z     | photoZ  | photoZErr  | Mag    | MagFilter  | MagErr  | classification reliability  | merged rank  |
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
| SN                | 1     | 2005.00    | PS1                   | 87331876323496707    | galaxy                 | multiple                  | 187.632388  | -17.2197069  | 1.85              |                          |                  |           |       |         |            | 20.94  | r          | 0.01    | association                 |              |
| SN                |       | 2005.00    | PanSTARRS DR1         | 87331876323496707    | galaxy                 |                           | 187.632388  | -17.2197069  | 1.85              |                          |                  |           |       |         |            | 20.94  | r          | 0.01    | association                 | 1            |
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
thespacedoctor commented 4 years ago

ZTF20aasikyz moves from ORPHAN > UNLCEAR

Transient's Predicted Classification: UNCLEAR
Suggested Associations:
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
| association type  | rank  | rankScore  | catalogue table name  | catalogue object id  | catalogue object type  | catalogue object subtype  | raDeg       | decDeg       | separationArcsec  | physical separation kpc  | direct distance  | distance  | z     | photoZ  | photoZErr  | Mag    | MagFilter  | MagErr  | classification reliability  | merged rank  |
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
| UNCLEAR           | 1     | 1010.55    | PS1                   | 92372318529638110    | uncertain              | multiple                  | 231.852898  | -13.0185416  | 0.55              |                          |                  |           |       |         |            | 22.46  | r          | 0.01    | synonym                     |              |
| UNCLEAR           |       | 1010.55    | PanSTARRS DR1         | 92372318529638110    | uncertain              |                           | 231.852898  | -13.0185416  | 0.55              |                          |                  |           |       |         |            | 22.46  | r          | 0.01    | synonym                     | 1            |
+-------------------+-------+------------+-----------------------+----------------------+------------------------+---------------------------+-------------+--------------+-------------------+--------------------------+------------------+-----------+-------+---------+------------+--------+------------+---------+-----------------------------+--------------+
thespacedoctor commented 4 years ago

I've performed a lot of tests and adjusted the search algorithm slightly here and there. I'm now happy with the results I'm getting. I'll send @genghisken the new algorithm.

smarttgit commented 4 years ago

I've performed a lot of tests and adjusted the search algorithm slightly here and there. I'm now happy with the results I'm getting. I'll send @genghisken the new algorithm.

go for it !

genghisken commented 4 years ago

For ATLAS and Pan-STARRS I'll switch the algorithms late tonight (2020-03-09) or tomorrow morning.

The catalogues are copied to Lasair (live - lasair-node0) but not yet to Lasair-dev (lasair-dev-node0). Even on Lasair (live) I need to move the PS1 catalogue in place and run the views (Gaia DR2 is already in place). The catalogues are physically in Edinburgh so it should be easy to copy them from lasair-node0 to lasair-dev-node0. Writing all this down so I remember what I need to do! Also raised as a Lasair github issue. See https://github.com/lsst-uk/lasair/issues/64.

Goodbye ORPHANs!

thespacedoctor commented 4 years ago

Remember I had to add a few indexes to the Gaia table to speed up crossmatch results. You might have to re-export to ROE. Also still cleaning up ~1% of the PS1 data I'm struggling to assign HTMIds to. Hope to have that issue resolved tomorrow.

Finally, you will need to update the helper tables that do the column matching. tcs_helper_catalogue_views_info & tcs_helper_catalogue_tables_info

genghisken commented 4 years ago

I forgot about moving the the tcs_helper_catalogue_tables_info and tcs_helper_catalogue_views_info tables from QUB to Lasair at ROE as you state above. Done now!

Additionally, I somehow managed to misname the tcs_view_unclear_ps1_dr1 to tcs_view_unknown_ps1_dr1. No idea how that happened. Renamed now. Sherlock is definitely working.

genghisken commented 4 years ago

There's still an ongoing issue with Sherlock, so I've reopened this issue. We are getting the following error.

File "/data/anaconda/envs/sherlock/lib/python2.7/site-packages/sherlock/transient_classifier.py", line 1935, in generate_match_annotation
   annotation = "The transient is %(classificationReliability)s with <em>%(objectId)s</em>; %(best_mag_filter)s%(best_mag)smag %(objectType)s found in the %(catalogueString)s. It's located %(location)s.%(absMag)s" % locals()
KeyError: 'classificationReliability'

The helper tables have been propagated, so it might just be a missing view or the way the view was created has changed (or not been propagated correctly).

thespacedoctor commented 4 years ago

Thanks Ken ...

Tracked the issue down to 622 rows inserted into the 'sherlock_crossmatches' table on 2020-01-06 at 05:17:25 where classificationReliability was set to null. This should never be the case. My guess is that the database table was left in a bad state after a sherlock process was killed. From email logs I know there was a table rebuild done at some stage on 6th Jan, but not at this early hour. Might be due to the same issue (issue number 2) as pointed out in issue #53 ... but also may not!

To fix the issue I removed the sherlock classifications related to these cross matched rows:

UPDATE objects 
SET 
    sherlock_classification = NULL
WHERE
    primaryId IN (SELECT 
            transient_object_id
        FROM
            sherlock_crossmatches
        WHERE
            classificationReliability IS NULL);

and then reran sherlock. Everything is now running fine and sherlock completes.