Open EiffL opened 5 years ago
I'm going to transcribe here some of the feedback provided by @kstoreyf :
It would be more interesting to have a brighter sample, but a broader one might be more widely applicable (and larger). the data volume will get big quickly; might think about also having a smaller set, like how people can download mnist & cifar-size datasets on their laptops. we probably couldn't have "labels" for everything, but could link to a catalog with all of the metadata. if we make a cut like i did, we will end up with a lot of junk, so need to think about having a messy sample vs a cleaner, possibly biased sample with more aggressive cuts - even with these, would be hard to eliminate all junk. finally, depending on the sample we decide on, it could take a while to download the data via the public streams - wonder if hsc would help us out (just to get the public data faster).
To address the question of the data volume, regenerating all the data from the HSC cutout server is of course a slow process. But we can aim for a dataset of a given maximum size, maybe ~20-30GB at most, that we can host on Zenodo, and it's still reasonable in size. To limit the total size to within this constraint, we can just randomly subsample.
Concerning the sample, one thing that I did a couple of years ago was to extract postage stamps corresponding to galaxies with specz redshifts in the HSC data, you can use that to train a flux+morph photo-z estimator ;-)
@mhuertascompany thoughts on interesting sample selection criteria ?
It has taken a while, but preliminary code is finally there on master that can be used to generate a multiband HSC dataset. The way it works is essentially just providing a .sql
file, then the code will download that catalog, and query all the corresponding cutouts, before aggregating them into a tfrecord dataset for super easy use.
Here is the command line to generate a test HSC dataset:
$ g2g-datagen --problem=img2img_hsc --data_dir=data/hsc_problem --tmp_dir=data/temp/hsc_problem
Once this is done, you can use this dataset from tensorflow super easily this way:
from galaxy2galaxy import problems
hsc = problems.problem('img2img_hsc')
# This creates a tf.data.Dataset that you can directly use to feed data to a network
dset = hsc.dataset(Modes.TRAIN, data_dir='data/hsc_problem')
Checkout an example of using this dataset here: https://github.com/ml4astro/galaxy2galaxy/blob/%234/notebooks/AccessingHSCProblems.ipynb
Now, the question of interest here is, how do you define an HSC problem? Here is an example: https://github.com/ml4astro/galaxy2galaxy/blob/517b2d1410c427fe1e1f32c010ab0f400211677c/galaxy2galaxy/data_generators/hsc.py#L101
Essentially, to create a new problem based a different sample, you just need to add the corresponding sql file in galaxy2galaxy/data_generators/hsc_utils
and add a subclass of HSCProblem
in hsc.py, with the p.sql_file set to the corresponding sql file name.
@kstoreyf could you have a look at this sql query, and tell me if it matches yours: https://github.com/ml4astro/galaxy2galaxy/blob/master/galaxy2galaxy/data_generators/hsc_utils/hsc_pdr2_wide_anomaly.sql If it does I can generate the corresponding dataset to make it easily accessible. Then you can train the G2G gans on it
If you want to match my sample without having to do any post-processing, instead of just grabbing the interpolatedcenter and crcenter flags, cut on them by adding under WHERE:
AND NOT f1.g_pixelflags_interpolatedcenter AND NOT f1.r_pixelflags_interpolatedcenter AND NOT f1.i_pixelflags_interpolatedcenter AND NOT f1.z_pixelflags_interpolatedcenter AND NOT f1.y_pixelflags_interpolatedcenter
AND NOT f1.g_pixelflags_crcenter AND NOT f1.r_pixelflags_crcenter AND NOT f1.i_pixelflags_crcenter AND NOT f1.z_pixelflags_crcenter AND NOT f1.y_pixelflags_crcenter
also i did a second query and merged them to get more metadata, would be good if you just add to the main query under SELECT:
m.g_blendedness_abs_flux, m.g_blendedness_flag,
m.r_blendedness_abs_flux, m.r_blendedness_flag,
m.i_blendedness_abs_flux, m.i_blendedness_flag,
m.z_blendedness_abs_flux, m.z_blendedness_flag,
m.y_blendedness_abs_flux, m.y_blendedness_flag,
-- Shape of the CModel model
m.i_cmodel_exp_ellipse_11, m.i_cmodel_exp_ellipse_22, m.i_cmodel_exp_ellipse_12,
m.i_cmodel_dev_ellipse_11, m.i_cmodel_dev_ellipse_22, m.i_cmodel_dev_ellipse_12,
m.i_cmodel_ellipse_11, m.i_cmodel_ellipse_22, m.i_cmodel_ellipse_12,
m.r_cmodel_exp_ellipse_11, m.r_cmodel_exp_ellipse_22, m.r_cmodel_exp_ellipse_12,
m.r_cmodel_dev_ellipse_11, m.r_cmodel_dev_ellipse_22, m.r_cmodel_dev_ellipse_12,
m.r_cmodel_ellipse_11, m.r_cmodel_ellipse_22, m.r_cmodel_ellipse_12
and under FROM: LEFT JOIN pdr2_wide.meas AS m USING (object_id)
...after writing all that i decided i should add a complete query anyway, here it is you can just copy this: https://github.com/kstoreyf/anomalies-GAN-HSC/blob/master/prepdata/catalog_clean.sql
Hehe, perfect, thanks so much!
I'm opening this PR to hopefully merge into G2G a version of the dataset that Kate has been working with ( https://github.com/kstoreyf/anomalies-GAN-HSC ). But I aslo understand that Connor is using HSC data as well, maybe also Dezso.
What we need is:
We can use this thread to discuss this question