ml4astro / galaxy2galaxy

Library of models, datasets, and utilities to build generative models for astronomical images.
MIT License
27 stars 7 forks source link

Create a standardize HSC dataset of mutliband galaxy cutouts #4

Open EiffL opened 5 years ago

EiffL commented 5 years ago

I'm opening this PR to hopefully merge into G2G a version of the dataset that Kate has been working with ( https://github.com/kstoreyf/anomalies-GAN-HSC ). But I aslo understand that Connor is using HSC data as well, maybe also Dezso.

What we need is:

We can use this thread to discuss this question

EiffL commented 5 years ago

I'm going to transcribe here some of the feedback provided by @kstoreyf : It would be more interesting to have a brighter sample, but a broader one might be more widely applicable (and larger). the data volume will get big quickly; might think about also having a smaller set, like how people can download mnist & cifar-size datasets on their laptops. we probably couldn't have "labels" for everything, but could link to a catalog with all of the metadata. if we make a cut like i did, we will end up with a lot of junk, so need to think about having a messy sample vs a cleaner, possibly biased sample with more aggressive cuts - even with these, would be hard to eliminate all junk. finally, depending on the sample we decide on, it could take a while to download the data via the public streams - wonder if hsc would help us out (just to get the public data faster).

EiffL commented 5 years ago

@mhuertascompany thoughts on interesting sample selection criteria ?

EiffL commented 5 years ago

It has taken a while, but preliminary code is finally there on master that can be used to generate a multiband HSC dataset. The way it works is essentially just providing a .sql file, then the code will download that catalog, and query all the corresponding cutouts, before aggregating them into a tfrecord dataset for super easy use.

Here is the command line to generate a test HSC dataset:

$ g2g-datagen --problem=img2img_hsc --data_dir=data/hsc_problem --tmp_dir=data/temp/hsc_problem

Once this is done, you can use this dataset from tensorflow super easily this way:

from galaxy2galaxy import problems

hsc = problems.problem('img2img_hsc')
# This  creates a tf.data.Dataset that you can directly use to feed data to a network
dset = hsc.dataset(Modes.TRAIN, data_dir='data/hsc_problem')

Checkout an example of using this dataset here: https://github.com/ml4astro/galaxy2galaxy/blob/%234/notebooks/AccessingHSCProblems.ipynb

Now, the question of interest here is, how do you define an HSC problem? Here is an example: https://github.com/ml4astro/galaxy2galaxy/blob/517b2d1410c427fe1e1f32c010ab0f400211677c/galaxy2galaxy/data_generators/hsc.py#L101

Essentially, to create a new problem based a different sample, you just need to add the corresponding sql file in galaxy2galaxy/data_generators/hsc_utils and add a subclass of HSCProblem in hsc.py, with the p.sql_file set to the corresponding sql file name.

EiffL commented 5 years ago

@kstoreyf could you have a look at this sql query, and tell me if it matches yours: https://github.com/ml4astro/galaxy2galaxy/blob/master/galaxy2galaxy/data_generators/hsc_utils/hsc_pdr2_wide_anomaly.sql If it does I can generate the corresponding dataset to make it easily accessible. Then you can train the G2G gans on it

kstoreyf commented 5 years ago

If you want to match my sample without having to do any post-processing, instead of just grabbing the interpolatedcenter and crcenter flags, cut on them by adding under WHERE:

AND NOT f1.g_pixelflags_interpolatedcenter AND NOT f1.r_pixelflags_interpolatedcenter AND NOT f1.i_pixelflags_interpolatedcenter AND NOT f1.z_pixelflags_interpolatedcenter AND NOT f1.y_pixelflags_interpolatedcenter

AND NOT f1.g_pixelflags_crcenter AND NOT f1.r_pixelflags_crcenter AND NOT f1.i_pixelflags_crcenter AND NOT f1.z_pixelflags_crcenter AND NOT f1.y_pixelflags_crcenter

also i did a second query and merged them to get more metadata, would be good if you just add to the main query under SELECT:

m.g_blendedness_abs_flux, m.g_blendedness_flag,
m.r_blendedness_abs_flux, m.r_blendedness_flag,
m.i_blendedness_abs_flux, m.i_blendedness_flag, 
m.z_blendedness_abs_flux, m.z_blendedness_flag, 
m.y_blendedness_abs_flux, m.y_blendedness_flag, 

    -- Shape of the CModel model
m.i_cmodel_exp_ellipse_11, m.i_cmodel_exp_ellipse_22, m.i_cmodel_exp_ellipse_12,
m.i_cmodel_dev_ellipse_11, m.i_cmodel_dev_ellipse_22, m.i_cmodel_dev_ellipse_12,
m.i_cmodel_ellipse_11, m.i_cmodel_ellipse_22, m.i_cmodel_ellipse_12,
m.r_cmodel_exp_ellipse_11, m.r_cmodel_exp_ellipse_22, m.r_cmodel_exp_ellipse_12,
m.r_cmodel_dev_ellipse_11, m.r_cmodel_dev_ellipse_22, m.r_cmodel_dev_ellipse_12,
m.r_cmodel_ellipse_11, m.r_cmodel_ellipse_22, m.r_cmodel_ellipse_12

and under FROM: LEFT JOIN pdr2_wide.meas AS m USING (object_id)

...after writing all that i decided i should add a complete query anyway, here it is you can just copy this: https://github.com/kstoreyf/anomalies-GAN-HSC/blob/master/prepdata/catalog_clean.sql

EiffL commented 5 years ago

Hehe, perfect, thanks so much!