Improve speed by using .section attribute rather than .data attribute

rlwastro commented 1 week ago

Accessing the .data attribute on a compressed image causes the entire image to be decompressed and read into memory. That defeats the ability of astrocut to reduce the IO requirements by reading only the necessary data.

Another benefit of the change is that it greatly reduces the memory requirements for the returned cutout. The current version using the data section returns a subset of the full image array, which winds up dragging the entire array along with it. The .section approach does not include the entire array.

This change instead uses the .section attribute to access the data, both when reading the pixels and when checking to see which extensions have images. The only remaining use of the .data attribute is for access to the cutout pixels. The implementation of the section attribute makes it efficient for accessing data in AWS S3 buckets as well.

The speedup ranges from a factor of 2 (for large cutouts) to at least a factor of 10 (for small cutouts).

This change also fixes a bug in the application of the memory_only flag. When true, it should not write any output files to disk. However, that was ignored when single_outfile is false. This fixes that bug. It was already working correctly when single_outfile is true.

rlwastro commented 1 week ago

@snbianco @falkben Just tagging you to keep this fix on your radar. I added a bit of text to the description. I only recently realized this is not just an impact on speed. Cutouts returned in memory also apparently return a view into the full (uncompressed) array, meaning the memory usage of the underlying data can be much larger than the cutout. I have an example that extracts 250 cutouts of size 256x256 pixels (total size about 65 MB). The resulting list from the current code was going to take 39 GB (!), which overflowed the 15 GB memory limit on the machine I was using. The new code using .section works correctly and avoids the memory issues.

I think we should try to get this merged soon as possible since it has some serious impacts on performance.

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.82%. Comparing base (11b0567) to head (667fd83). Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
astrocut/cutouts.py	87.50%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #132 +/- ## ========================================== + Coverage 94.61% 95.82% +1.20% ========================================== Files 9 11 +2 Lines 1524 1653 +129 ========================================== + Hits 1442 1584 +142 + Misses 82 69 -13 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

falkben commented 6 days ago

Very nice. Sorry it took a while for me to take a look at this. I wonder if we need similar change (.data -> .section) for this part of code:

https://github.com/spacetelescope/astrocut/blob/main/astrocut/cutout_processing.py#L463

and maybe that would speed up the moving target cutouts considerably?

rlwastro commented 6 days ago

@falkben That's a good question. I think the answer is yes, we should make the same change there. If this code that pieces together images from different skycells gets applied to Pan-STARRS images (which is on my list of things to work on), it will be much faster if .data is changed to .section in that code.

I believe these changes will not make much difference in speed for uncompressed images (e.g., TESS data cubes and HAP-MVM images). But I have not actually done timing tests on uncompressed images in the S3 bucket to see whether it has an effect on those images too. And it is possible that the .section approach will improve memory handling even for uncompressed images. I think it can't hurt in any case to make the change you suggest.

snbianco commented 6 days ago

Agreed! I'll make this change on the same MR as the updated unit tests and do some benchmarking as well.

snbianco commented 2 days ago

@rlwastro @falkben After looking into this a bit more, I don't think that changing line 463 in cutout_processing.py will work. build_default_combine_function initializes a function to combine an array of cutouts, and the entirety of each cutout has to be loaded in to memory to check for NaN values and compute the pixel weights.

rlwastro commented 2 days ago

@snbianco I have not quite been able to figure out what the parameters are for build_default_define_function. But assuming that they are already image cutouts that have been extracted from the larger images, I think you are correct. And in fact that line of code:

img_arrs = np.array([hdu.data for hdu in template_hdu_arr])

... only really makes sense if the data attributes are used rather than the section attributes. It implies that we are stacking up all the individual cutouts into a big cube.

So yes, I agree with you!

spacetelescope / astrocut

Improve speed by using .section attribute rather than .data attribute #132

Codecov Report