Closed jeff-regier closed 8 years ago
This is a pretty good description.
Concretely, we need to transform the FITS files into Celeste-readable data structures. I think it makes sense to do this as a pre-processing step since it's a little time-intensive and we should avoid doing it redundantly.
Concretely, before optimizing the model, we currently need to run bin/preprocess_image.jl
on FITS files that have been downloaded for a given (run, field, camcol) tuple with the script bin/download_fits_files.py
. I now save the output in a JLD file on disk, but it would be better to have it in some kind of database.
Note that some objects (about 10% of them perhaps?) will require looking at multiple (run, camcol, field) tuples for optimization, so there is not necessarily a many-to-one mapping from objects to (run, camcol, field) tuples. Constructing this mapping will also have to be part of the pre-processing step. That shouldn't be too hard, but I'm still working on how to do that.
I'll also add that right now the whole Polygons.jl
stuff is probably unnecessary complication. I'll assign myself an issue to get rid of it and do it more simply.
On second thought, maybe the function should return a collection of sub-images rather than "tiles"---tiles are kind of a Celeste abstraction. This function should be general.
Seems like SciDB and MongoDB are the leading candidates for a place to store the data. Both support spatial indexing. I think we want that functionality.
Cool, let's plan to look into those database options. You definitely need more than the Celeste ImageTile
object. The preprocess_image.jl
file is probably the best reference for exactly what you do need. Indexing the DB by (RA, DEC) would be a good idea, though indexing by (RUN, CAMCOL, FIELD) will also do, since we have the modestly sized window_flist.fits
which maps between these two.
To run Celeste.jl on the whole SDSS dataset, we need a function that each task can call to fetch just the relevant parts of the dataset. As input, this function takes a task id and I guess the total number of tasks. It returns