Add streaming support, first step for AARTFAAC integration

gijzelaerr commented 8 years ago

I think it is good to have this one already up for review, otherwise it gets too big.

This PR consists of:

A rewrite of the main logic, split if up in logical units
A modification of most (distributed) steps to pass around accessors, not image 'urls'.
Replacing pywcs with astro.pwcs
Adding a FitsImageBlob accessor which is used for representing in memory AARTFAAC images.

timstaley commented 8 years ago

Mostly looks good to me - my comments above are pretty much all minor stylistic things. Nice job tidying up the logic in main.py. However, I do have a pretty big concern - in batch (regular, non-AARTFAAC) mode, get_accessors loads every single image file into RAM before continuing - that seems like a recipe for disaster! If each image is 100mb then it's only 20 images and you're maxing out a low-end laptop - less if we end up copying around the arrays while sourcefinding etc. Thinking about it, that's why we passed around 'urls' before - we do a full scan of the data at the persistence stage, then we don't need the images again until sourcefinding. But on the other hand it would be nice to find a middle path that is compatible with both AARTFAAC image sets, and large offline batch sets.

How about: you could perhaps make the batch version of 'get_accessors' a generator? So it looks like a list but only actually gets one image at a time. Of course you'll need to refresh it each time you cycle through the images, but I think that gives you the best chance at using a single set of syntax for both use cases.

gijzelaerr commented 8 years ago

hm good point about te momory issue. This requires some more smart thinking. problem is we need to open all files to get the timestamp, so we know in what order we process them. Maybe we should process the data in chunks per timestep. So first scan all files, extract metadata and then decide in which order to process them, then split them up in chunks per timestep and process. Going to look at this on monday.

timstaley commented 8 years ago

Sounds good - metadata extraction and quality check should probably also be combined into one function per-image, that would reduce the number of access counts.

gijzelaerr commented 8 years ago

yes, but that is an other issue. I think we should make the whole main logic a bit smarter, there is room for more optimization but the current home-made implementation of mappers and runners is a bit dodgy (yes I made that) and doesn't make it easy to see the optimization points.

gijzelaerr commented 8 years ago

Ok, I've rewritten this a bit so not all images are opened in one go. This doesn't combine the quality control step with the stat time extraction though, that would require much more changes since to be able to reject an image we need a database entry and we only have that later on. We might want to do that in the future, but for now this doesn't degrade performance compared to how it was. Also I don't use generators (yet) but I might change that when I integrate the eventual streaming mode (working on that now).

gijzelaerr commented 8 years ago

I think this is good for reevaluation!

transientskp / tkp

Add streaming support, first step for AARTFAAC integration #515