jeff-regier / Celeste.jl

Scalable inference for a generative model of astronomical images
MIT License
184 stars 28 forks source link

fetch subsets of SDSS #82

Closed jeff-regier closed 8 years ago

jeff-regier commented 9 years ago

To run Celeste.jl on the whole SDSS dataset, we need a function that each task can call to fetch just the relevant parts of the dataset. As input, this function takes a task id and I guess the total number of tasks. It returns

  1. a subset of the astronomical objects (stars and galaxies) in existing catalogs for this task to optimize,
  2. all the "tiles'' (a.k.a. "super-pixels"---subimages that are 10 pixels wide and 10 pixels high) that are "near" any astronomical objects in the subset, and
  3. any additional astronomical objects that are near any of these tiles. (The task isn't responsible for optimizing these astronomical objects, but it may use them to help explain the data, i.e. the tiles.)
rgiordan commented 9 years ago

This is a pretty good description.

Concretely, we need to transform the FITS files into Celeste-readable data structures. I think it makes sense to do this as a pre-processing step since it's a little time-intensive and we should avoid doing it redundantly.

Concretely, before optimizing the model, we currently need to run bin/preprocess_image.jl on FITS files that have been downloaded for a given (run, field, camcol) tuple with the script bin/download_fits_files.py. I now save the output in a JLD file on disk, but it would be better to have it in some kind of database.

Note that some objects (about 10% of them perhaps?) will require looking at multiple (run, camcol, field) tuples for optimization, so there is not necessarily a many-to-one mapping from objects to (run, camcol, field) tuples. Constructing this mapping will also have to be part of the pre-processing step. That shouldn't be too hard, but I'm still working on how to do that.

rgiordan commented 9 years ago

I'll also add that right now the whole Polygons.jl stuff is probably unnecessary complication. I'll assign myself an issue to get rid of it and do it more simply.

jeff-regier commented 9 years ago

On second thought, maybe the function should return a collection of sub-images rather than "tiles"---tiles are kind of a Celeste abstraction. This function should be general.

Seems like SciDB and MongoDB are the leading candidates for a place to store the data. Both support spatial indexing. I think we want that functionality.

rgiordan commented 9 years ago

Cool, let's plan to look into those database options. You definitely need more than the Celeste ImageTile object. The preprocess_image.jl file is probably the best reference for exactly what you do need. Indexing the DB by (RA, DEC) would be a good idea, though indexing by (RUN, CAMCOL, FIELD) will also do, since we have the modestly sized window_flist.fits which maps between these two.