Create a standardize HSC dataset of mutliband galaxy cutouts

EiffL commented 5 years ago

I'm opening this PR to hopefully merge into G2G a version of the dataset that Kate has been working with ( https://github.com/kstoreyf/anomalies-GAN-HSC ). But I aslo understand that Connor is using HSC data as well, maybe also Dezso.

What we need is:

[x] All the scripts to regenerate the dataset
[ ] Agree on a default set of cuts that would make this sample most useful to a maximum number of people
[ ] Produce an archive of the dataset
[ ] Publish it on Zenodo
[ ] Implement the corresponding data_generator for galaxy2galaxy

We can use this thread to discuss this question

EiffL commented 5 years ago

I'm going to transcribe here some of the feedback provided by @kstoreyf : It would be more interesting to have a brighter sample, but a broader one might be more widely applicable (and larger). the data volume will get big quickly; might think about also having a smaller set, like how people can download mnist & cifar-size datasets on their laptops. we probably couldn't have "labels" for everything, but could link to a catalog with all of the metadata. if we make a cut like i did, we will end up with a lot of junk, so need to think about having a messy sample vs a cleaner, possibly biased sample with more aggressive cuts - even with these, would be hard to eliminate all junk. finally, depending on the sample we decide on, it could take a while to download the data via the public streams - wonder if hsc would help us out (just to get the public data faster).

To address the question of the data volume, regenerating all the data from the HSC cutout server is of course a slow process. But we can aim for a dataset of a given maximum size, maybe ~20-30GB at most, that we can host on Zenodo, and it's still reasonable in size. To limit the total size to within this constraint, we can just randomly subsample.
Concerning the sample, one thing that I did a couple of years ago was to extract postage stamps corresponding to galaxies with specz redshifts in the HSC data, you can use that to train a flux+morph photo-z estimator ;-)

EiffL commented 5 years ago

@mhuertascompany thoughts on interesting sample selection criteria ?

EiffL commented 5 years ago

It has taken a while, but preliminary code is finally there on master that can be used to generate a multiband HSC dataset. The way it works is essentially just providing a .sql file, then the code will download that catalog, and query all the corresponding cutouts, before aggregating them into a tfrecord dataset for super easy use.

Here is the command line to generate a test HSC dataset:

$ g2g-datagen --problem=img2img_hsc --data_dir=data/hsc_problem --tmp_dir=data/temp/hsc_problem

Once this is done, you can use this dataset from tensorflow super easily this way:

from galaxy2galaxy import problems

hsc = problems.problem('img2img_hsc')
# This  creates a tf.data.Dataset that you can directly use to feed data to a network
dset = hsc.dataset(Modes.TRAIN, data_dir='data/hsc_problem')

Checkout an example of using this dataset here: https://github.com/ml4astro/galaxy2galaxy/blob/%234/notebooks/AccessingHSCProblems.ipynb

Now, the question of interest here is, how do you define an HSC problem? Here is an example: https://github.com/ml4astro/galaxy2galaxy/blob/517b2d1410c427fe1e1f32c010ab0f400211677c/galaxy2galaxy/data_generators/hsc.py#L101

Essentially, to create a new problem based a different sample, you just need to add the corresponding sql file in galaxy2galaxy/data_generators/hsc_utils and add a subclass of HSCProblem in hsc.py, with the p.sql_file set to the corresponding sql file name.

EiffL commented 5 years ago

@kstoreyf could you have a look at this sql query, and tell me if it matches yours: https://github.com/ml4astro/galaxy2galaxy/blob/master/galaxy2galaxy/data_generators/hsc_utils/hsc_pdr2_wide_anomaly.sql If it does I can generate the corresponding dataset to make it easily accessible. Then you can train the G2G gans on it

kstoreyf commented 5 years ago

If you want to match my sample without having to do any post-processing, instead of just grabbing the interpolatedcenter and crcenter flags, cut on them by adding under WHERE:

AND NOT f1.g_pixelflags_interpolatedcenter AND NOT f1.r_pixelflags_interpolatedcenter AND NOT f1.i_pixelflags_interpolatedcenter AND NOT f1.z_pixelflags_interpolatedcenter AND NOT f1.y_pixelflags_interpolatedcenter

AND NOT f1.g_pixelflags_crcenter AND NOT f1.r_pixelflags_crcenter AND NOT f1.i_pixelflags_crcenter AND NOT f1.z_pixelflags_crcenter AND NOT f1.y_pixelflags_crcenter

also i did a second query and merged them to get more metadata, would be good if you just add to the main query under SELECT:

m.g_blendedness_abs_flux, m.g_blendedness_flag,
m.r_blendedness_abs_flux, m.r_blendedness_flag,
m.i_blendedness_abs_flux, m.i_blendedness_flag, 
m.z_blendedness_abs_flux, m.z_blendedness_flag, 
m.y_blendedness_abs_flux, m.y_blendedness_flag, 

    -- Shape of the CModel model
m.i_cmodel_exp_ellipse_11, m.i_cmodel_exp_ellipse_22, m.i_cmodel_exp_ellipse_12,
m.i_cmodel_dev_ellipse_11, m.i_cmodel_dev_ellipse_22, m.i_cmodel_dev_ellipse_12,
m.i_cmodel_ellipse_11, m.i_cmodel_ellipse_22, m.i_cmodel_ellipse_12,
m.r_cmodel_exp_ellipse_11, m.r_cmodel_exp_ellipse_22, m.r_cmodel_exp_ellipse_12,
m.r_cmodel_dev_ellipse_11, m.r_cmodel_dev_ellipse_22, m.r_cmodel_dev_ellipse_12,
m.r_cmodel_ellipse_11, m.r_cmodel_ellipse_22, m.r_cmodel_ellipse_12

and under FROM: LEFT JOIN pdr2_wide.meas AS m USING (object_id)

...after writing all that i decided i should add a complete query anyway, here it is you can just copy this: https://github.com/kstoreyf/anomalies-GAN-HSC/blob/master/prepdata/catalog_clean.sql

EiffL commented 5 years ago

Hehe, perfect, thanks so much!

ml4astro / galaxy2galaxy

Create a standardize HSC dataset of mutliband galaxy cutouts #4