spacetelescope / astrocut

Tools for making image cutouts from sets of TESS full frame images
https://astrocut.readthedocs.io
25 stars 12 forks source link

Clear cache for gc #91

Closed falkben closed 1 year ago

falkben commented 1 year ago

The code:

cube._file._file.fs.clear_instance_cache()

Effectively is calling s3fs.core.S3FileSystem.clear_instance_cache. This clears the cache on the s3fs.core.S3FileSystem class. If this is not done, the cache continues to grow, and repeated calls to cube_cut results in effectively a memory leak.

This problem was made obvious with TESSCut, where the service performs cutouts over the lifetime of the server process. Even with cutouts of the exact same location, the cache continued to grow and would grow to multiple gigabytes.

We are not testing with other types of cloud storage, but this should work for any fsspec compatible file system, since they all inherit from the base class:

https://github.com/fsspec/filesystem_spec/blob/45a6aec7da1407243f9767c6ab0cff40efee72eb/fsspec/spec.py#L1412-L1425

When making repeated cutouts, to the same location, the cache is useful to have, at it seems to retrieve data from the cache first before making the external requests.

It's not clear to me what extent cutouts made by an individual user would benefit from this cache, though, since most cutouts are unique.

Within tesscut, with multiple users, it's possible that cutouts could be made to the same locations, but probably not very likely. Additionally, the cutouts would also have to be made inside the same process.

We could consider making the call to s3fs.core.S3FileSystem.clear_instance_cache() from within TESSCut itself instead of in the cube_cut() function, if we didn't believe that it would useful for individual users to clear their cache. If we did this, then TESSCut could choose how often to clear it's own cache.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (fe56d85) 94.45% compared to head (b265250) 94.46%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #91 +/- ## ======================================= Coverage 94.45% 94.46% ======================================= Files 8 8 Lines 1443 1445 +2 ======================================= + Hits 1363 1365 +2 Misses 80 80 ``` | [Files Changed](https://app.codecov.io/gh/spacetelescope/astrocut/pull/91?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spacetelescope) | Coverage Δ | | |---|---|---| | [astrocut/cube\_cut.py](https://app.codecov.io/gh/spacetelescope/astrocut/pull/91?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spacetelescope#diff-YXN0cm9jdXQvY3ViZV9jdXQucHk=) | `98.97% <100.00%> (+<0.01%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

falkben commented 1 year ago

Now that I have this PR sitting here, I'm actually leaning more towards doing the cache clearing inside tesscut and not in astrocut.

I worry about clearing a user's cache unnecessarily, if they were calling cube_cut() directly. They would only experience an "issue" if they made repeated calls to cube_cut() with the same cutout params, so it doesn't seem especially important.

Since tesscut knows the filesystem it's using (s3fs.core.S3FileSystem) we would also avoid the issue of having to reach into "protected" member functions within astropy fits.

falkben commented 1 year ago

Closing in favor of just handling this in tesscut instead