Open keflavich opened 2 years ago
this is mostly just copy-pasted at barely-past-the-idea stage. @astrofrog any chance you (or aperio folks) could find some time to work on this?
partly, I'm wondering if we should be pushing the "don't put everything in memory" part into reproject or leave it here in spectral-cube?
FYI I am currently working on dask support in reproject - more soon!
@astrofrog can you help me out here - do we need this in spectral-cube or can spectral-cube just call reproject and not worry about it?
@keflavich - right now I've added the ability in reproject to use dask behind the scenes to operate over data chunks, but I haven't yet made it possible to pass in/get out dask arrays. It shouldn't be too difficult to do that though, so if you are willing to wait a little longer, then it will be possible to just pass the dask array to reproject and get a dask array back with the reprojected cube. I can try and prioritize that for the next couple of weeks.
Just curious, would you actually want the reprojection to be lazily done? That is, creating the new cube would be 'instantaneous' and reprojection would only happen as parts of the cube are accessed. Or would you want the reprojection to actually be computed in full?
I'm not entirely sure yet - I'm working this out as I go.
The active use case is:
https://github.com/radio-astro-tools/spectral-cube/pull/868 is my attempt to make this a bit possible
ideally under the hood we'd reproject to the minimal overlap region... where should that step be done?
@keflavich - what do you mean by minimal overlap region? The mosaic WCS, or the subset of the mosaic WCS that overlaps with each cube?
I mean the corner-finding that is done in coadd.py
.
I'm refactoring to use coadd.py with some additions: https://github.com/astropy/reproject/pull/351
we have cubes covering this whole zone with >1000 pixels/cube, so totaling ~1 TB/cube
Do you mean they have >1000 pixels on each side? Do you really mean 1Tb/cube or 1Gb/cube?
output cube target size is 27800 x 10200 x 1000. Probably will limit 3rd dimension to ~200 pixels in some cases.
I revised the original definition of the cube sizes above in reply to https://github.com/radio-astro-tools/spectral-cube/pull/845#issuecomment-1467030626
@keflavich ok thanks! I have started work on supporting dask inputs to reproject functions, will open a draft PR soon.
However, feel free to also try on your side if you want, I will keep an eye on your PR :)
I think we need both of these together, no? I'm adding 3D capability to reproject_and_coadd
, you're making it all daskable, right?
ah, no, you're talking about _this_PR. Yeah, I'm not going to work on this while you are - I'm focused on the other support infrastructure for now
I did mean the reproject PR not this one - yes our work should end up being complementary! (I will just focus on the individual reproject functions not coadd)
Patch coverage is 4.54%
of modified lines.
:exclamation: Current head 672245e differs from pull request most recent head df32208. Consider uploading reports for the commit df32208 to get more accurate results
Files Changed | Coverage |
---|---|
spectral_cube/dask_spectral_cube.py | 4.54% |
:loudspeaker: Thoughts on this report? Let us know!.
@astrofrog I'm trying to test some of the upstream dask changes, but I need this PR as a frontend to those. I've run into a problem - the reproject_interp
wrapper:
def reproject_interp_wrapper(img_slice, **kwargs):
# What exactly is the wrapper getting here?
# I think it is given a _cube_ that is a cutout?
# No, it is getting dask arrays (at least sometimes)
if filled:
data = img_slice.filled_data[:]
else:
data = img_slice._data
return reproject_interp((data, img_slice.header),
is wrong - img_slice
is (in my implementation above) assumed to be a cube subset. But instead it is an array.
So I think this PR is totally wrong, taking the wrong general approach - instead, perhaps we should just directly call reproject.reproject_interp
without going through apply_parallel_spatial
?
Yes I think the intermediate wrappers pre date some of the reproject improvements. We should be able to just call reproject with the full dask array and target WCS.
@keflavich - I've rebased and pushed an additional commit that I think makes the DaskSpectralCube.reproject
method work how it should, that is is doesn't actually do the reprojection and instead passes on the resulting dask array to the new DaskSpectralCube
(and I think a user can then choose to dump to tmp?). However, this currently fails with the stable version of reproject due to the tmp_dir bug that we discussed before, so looking into that now.
Ok so with https://github.com/astropy/reproject/pull/390 I can get things to run. However, this is currently not optimal for two reasons:
It might be better to be pragmatic here and actually carry out the reprojection inside the reproject function rather than delaying it, at least by default, because it gives us more control over the points above. I'll try and work on this a little more shortly.
Yes, I agree - the default should be least likely to cause harm on the user's computer (i.e., unlikely to fill up their HDs with tmpdir junk, unlikely to just crash), and exceptional cases needed for performance/other reasons should be described in the docs
WIP: trying to get dask reprojection working.