radio-astro-tools / spectral-cube

Library for reading and analyzing astrophysical spectral data cubes
http://spectral-cube.rtfd.org
BSD 3-Clause "New" or "Revised" License
98 stars 65 forks source link

Dask reproject #845

Open keflavich opened 2 years ago

keflavich commented 2 years ago

WIP: trying to get dask reprojection working.

keflavich commented 2 years ago

this is mostly just copy-pasted at barely-past-the-idea stage. @astrofrog any chance you (or aperio folks) could find some time to work on this?

partly, I'm wondering if we should be pushing the "don't put everything in memory" part into reproject or leave it here in spectral-cube?

astrofrog commented 2 years ago

FYI I am currently working on dask support in reproject - more soon!

keflavich commented 1 year ago

@astrofrog can you help me out here - do we need this in spectral-cube or can spectral-cube just call reproject and not worry about it?

astrofrog commented 1 year ago

@keflavich - right now I've added the ability in reproject to use dask behind the scenes to operate over data chunks, but I haven't yet made it possible to pass in/get out dask arrays. It shouldn't be too difficult to do that though, so if you are willing to wait a little longer, then it will be possible to just pass the dask array to reproject and get a dask array back with the reprojected cube. I can try and prioritize that for the next couple of weeks.

astrofrog commented 1 year ago

Just curious, would you actually want the reprojection to be lazily done? That is, creating the new cube would be 'instantaneous' and reprojection would only happen as parts of the cube are accessed. Or would you want the reprojection to actually be computed in full?

keflavich commented 1 year ago

I'm not entirely sure yet - I'm working this out as I go.

The active use case is:

https://github.com/radio-astro-tools/spectral-cube/pull/868 is my attempt to make this a bit possible

ideally under the hood we'd reproject to the minimal overlap region... where should that step be done?

astrofrog commented 1 year ago

@keflavich - what do you mean by minimal overlap region? The mosaic WCS, or the subset of the mosaic WCS that overlaps with each cube?

keflavich commented 1 year ago

I mean the corner-finding that is done in coadd.py.

I'm refactoring to use coadd.py with some additions: https://github.com/astropy/reproject/pull/351

astrofrog commented 1 year ago

we have cubes covering this whole zone with >1000 pixels/cube, so totaling ~1 TB/cube

Do you mean they have >1000 pixels on each side? Do you really mean 1Tb/cube or 1Gb/cube?

keflavich commented 1 year ago

output cube target size is 27800 x 10200 x 1000. Probably will limit 3rd dimension to ~200 pixels in some cases.

keflavich commented 1 year ago

I revised the original definition of the cube sizes above in reply to https://github.com/radio-astro-tools/spectral-cube/pull/845#issuecomment-1467030626

astrofrog commented 1 year ago

@keflavich ok thanks! I have started work on supporting dask inputs to reproject functions, will open a draft PR soon.

astrofrog commented 1 year ago

However, feel free to also try on your side if you want, I will keep an eye on your PR :)

keflavich commented 1 year ago

I think we need both of these together, no? I'm adding 3D capability to reproject_and_coadd, you're making it all daskable, right?

keflavich commented 1 year ago

ah, no, you're talking about _this_PR. Yeah, I'm not going to work on this while you are - I'm focused on the other support infrastructure for now

astrofrog commented 1 year ago

I did mean the reproject PR not this one - yes our work should end up being complementary! (I will just focus on the individual reproject functions not coadd)

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage is 4.54% of modified lines.

:exclamation: Current head 672245e differs from pull request most recent head df32208. Consider uploading reports for the commit df32208 to get more accurate results

Files Changed Coverage
spectral_cube/dask_spectral_cube.py 4.54%

:loudspeaker: Thoughts on this report? Let us know!.

keflavich commented 1 year ago

@astrofrog I'm trying to test some of the upstream dask changes, but I need this PR as a frontend to those. I've run into a problem - the reproject_interp wrapper:

        def reproject_interp_wrapper(img_slice, **kwargs):
            # What exactly is the wrapper getting here?
            # I think it is given a _cube_ that is a cutout?
            # No, it is getting dask arrays (at least sometimes)
            if filled:
                data = img_slice.filled_data[:]
            else:
                data = img_slice._data
            return reproject_interp((data, img_slice.header),

is wrong - img_slice is (in my implementation above) assumed to be a cube subset. But instead it is an array.

So I think this PR is totally wrong, taking the wrong general approach - instead, perhaps we should just directly call reproject.reproject_interp without going through apply_parallel_spatial?

astrofrog commented 1 year ago

Yes I think the intermediate wrappers pre date some of the reproject improvements. We should be able to just call reproject with the full dask array and target WCS.

astrofrog commented 1 year ago

@keflavich - I've rebased and pushed an additional commit that I think makes the DaskSpectralCube.reproject method work how it should, that is is doesn't actually do the reprojection and instead passes on the resulting dask array to the new DaskSpectralCube (and I think a user can then choose to dump to tmp?). However, this currently fails with the stable version of reproject due to the tmp_dir bug that we discussed before, so looking into that now.

astrofrog commented 1 year ago

Ok so with https://github.com/astropy/reproject/pull/390 I can get things to run. However, this is currently not optimal for two reasons:

It might be better to be pragmatic here and actually carry out the reprojection inside the reproject function rather than delaying it, at least by default, because it gives us more control over the points above. I'll try and work on this a little more shortly.

keflavich commented 1 year ago

Yes, I agree - the default should be least likely to cause harm on the user's computer (i.e., unlikely to fill up their HDs with tmpdir junk, unlikely to just crash), and exceptional cases needed for performance/other reasons should be described in the docs