radio-astro-tools / spectral-cube

Library for reading and analyzing astrophysical spectral data cubes
http://spectral-cube.rtfd.org
BSD 3-Clause "New" or "Revised" License
97 stars 65 forks source link

Refactor: use data * unit rather than u.Quantity(data, unit) #775

Closed keflavich closed 2 years ago

keflavich commented 2 years ago

It appears that u.Quantity(data, unit) triggers a full file read, which is disastrous as the whole point of spectral-cube is not to trigger unnecessary IO.

My preference for using that construct came from some very old insight gained when performance testing code for pyradex. If you're adding units a lot of time in a loop, data * unit is expensive. But, we're not expecting to add the units many times - that's not a likely bottleneck. Instead, IO is a bottleneck. It makes more sense to have:

data_with_unit = cube._data * u.K

than

data_with_unit = u.Quantity(cube._data, unit=u.K, copy=False)

for our use cases; we often want to stitch units onto dask arrays, which doesn't play well with u.Quantity's constructor.

(caveat: I haven't tested this yet, so it's possible there's some astropy version dependence)

keflavich commented 2 years ago

This PR is pretty wrong because u.Quantity(x, unit) could sometimes mean "Attach unit unit to unitless x" and sometimes mean "Convert quantity x to unit unit"

keflavich commented 2 years ago

The actual bottleneck is here: https://github.com/astropy/astropy/blob/17e281e/astropy/units/quantity.py#L504-L505 i.e., np.array(dask_data)

keflavich commented 2 years ago

https://github.com/astropy/astropy/issues/8227#issuecomment-951431146 is the same issue

dhomeier commented 2 years ago

i.e., np.array(dask_data)

and then https://github.com/dask/dask/blob/ecddc8df5869679cef5280c869674cbf0d9f60d9/dask/array/core.py#L1551-L1552

This PR is pretty wrong because u.Quantity(x, unit) could sometimes mean "Attach unit unit to unitless x" and sometimes mean "Convert quantity x to unit unit"

Apart from this and (data * unit) not having a .unit method, does the workaround fail anyway to avoid the read?

keflavich commented 2 years ago

I don't know! I think it does, but I didn't get to checking.

keflavich commented 2 years ago

Well, OK, I do know that multiplying a dask array by a unit does not trigger the read, so I'm pretty sure this would work within spectral-cube. I tested by loading a spectral-cube and multiplying cube._data (which was a dask array) by u.K - that returns a dask array, no problem. u.Quantity(cube._data, u.K) triggers the read.

dhomeier commented 2 years ago

isintance(data, u.Quantity) would be easy enough to check for separate treatment for a true Quantity, but what if it is already a (dask_arr * unit)...?

keflavich commented 2 years ago

yes exactly. I started going down the wrapper route a few hours ago, but I think we need python's formal wrapping tools for that, and then we're getting dangerously close to trying to implement the thing Martin said was very hard.

keflavich commented 2 years ago

....I don't know how I managed to flub a merge here.... but I tried to revert the original changeset and go with a wrapper approach, but it's totally WIP and nothing works yet, it's just ideas I started putting on paper

dhomeier commented 2 years ago

what if it is already a (dask_arr * unit)...?

Looks like isinstance(data._meta, u.Quantity) would work in that case (and also allow to extract the unit), but that would indeed go some way to writing a full wrapper.

keflavich commented 2 years ago

superceded by #783