spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
561 stars 167 forks source link

JP-3706: Set outlier detection buffer size based on total available memory #8756

Closed emolter closed 2 weeks ago

emolter commented 3 weeks ago

Resolves JP-3706

Closes #8708

To save memory while computing the median over a large number of input models in outlier_detection, models are loaded in spatial sections using outlier_detection.utils.get_sections(). The size of these sections used to be governed by the buffer_size parameter, which was hard-coded to a default value expressed in megabytes that controlled the size of section that is loaded per-model. This was a clunky way of expressing this, and led to many more model read-write operations (and therefore much longer runtimes than necessary) when the total memory allocation, i.e., buffer size times number of models, was much smaller than the available memory on the machine.

This PR re-purposes the outlier detection step's pre-existing allowed_memory parameter, which specifies the amount of allowed memory usage as a fraction of the total available memory, to specify the size of the sections loaded into memory. The buffer_size parameter in get_sections is now computed as available_memory / len(resampled_models.group_names) / _ONE_MB (it expects a number in megabytes).

Checklist for PR authors (skip items if you don't have permissions or they are not applicable)

emolter commented 3 weeks ago

regression tests started here

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Project coverage is 60.84%. Comparing base (e16c058) to head (2504d0f). Report is 46 commits behind head on master.

Files with missing lines Patch % Lines
jwst/outlier_detection/utils.py 91.66% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #8756 +/- ## ======================================= Coverage 60.83% 60.84% ======================================= Files 373 373 Lines 38636 38647 +11 ======================================= + Hits 23503 23513 +10 - Misses 15133 15134 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

emolter commented 3 weeks ago

the regression test failures all match ones in the nightly runs either two nights ago or last night and the reason it's not a perfect match to either night is because Tyler was fixing regression tests yesterday.

braingram commented 3 weeks ago

Since allowed memory defaults to None does that mean that outlier detection will now use up to half the available memory for the median (where before it was 10 MB)? It looks like the "memory" is including swap so it will likely start swapping before using a smaller buffer. It's also hard to think how this new use and the existing use of "allowed memory" (for resample called by outlier detection) will interact.

Are there options to:

emolter commented 3 weeks ago

Since allowed memory defaults to None does that mean that outlier detection will now use up to half the available memory for the median (where before it was 10 MB)

Yes, it will use half the available memory in total (or whatever we set the default to). But note that before, it was using 10 MB per model, so the total memory was not well defined.

Are there options to add a new spec parameter to avoid the interaction with resample

We could do this of course. But I don't know what you mean about interactions between them. The only thing the allowed_memory currently does is to raise an error if the amount of memory needed to hold the output array from resample is larger than what is allowed. The memory should all be released once median computation is done, so I wouldn't expect any conflict there, and I can't think of a scenario where you would want to allow the pipeline a different amount of memory while running median computation vs resampling.

Are there options to reduce the default to avoid swapping

Yes, this is a good idea. Do you think it would be better to set the default to some fraction of the total without swapping, or just decrease the number? And what do you think is reasonable? TBH I have no sense for this.

braingram commented 3 weeks ago

One scenario is a user sets allowed_memory to say 100MB. Let's say resample (within outlier detection) generates grouped drizzled models of 80MB size. This will pass without an error (since it's below allowed_memory). During the median calculation, these drizzled models will be read in (one at a time, so +80MB) and the buffer will be set to 100MB (+180MB) plus the size of the median (+?MB). Let's say this exceeds the allowed memory for the user, so they reduce allowed_memory to try and stay under their limit. They drop it to 50MB to reduce the buffer size and now resample fails.

The docs also mention the use of the environment variable DMODEL_ALLOWED_MEMORY when allowed_memory is None. Is that supported by this PR? I don't suggest we add support for it and instead use a different variable/name that isn't already used by resample.

I'm not sure picking a one-size-fits-all buffer size will ever work. Zarr defaults to 1MB chunks: https://zarr.readthedocs.io/en/v2.1.0/tutorial.html#chunk-size-and-shape but our uses here are encumbered by the datamodels requiring loading substantially more than the chunk.

emolter commented 3 weeks ago

One scenario is a user sets allowed_memory to say 100MB. Let's say resample (within outlier detection) generates grouped drizzled models of 80MB size. This will pass without an error (since it's below allowed_memory). During the median calculation, these drizzled models will be read in (one at a time, so +80MB) and the buffer will be set to 100MB (+180MB) plus the size of the median (+?MB). Let's say this exceeds the allowed memory for the user, so they reduce allowed_memory to try and stay under their limit. They drop it to 50MB to reduce the buffer size and now resample fails.

Okay, sure, but in the scenario you outlined I feel as though the real problem is that when the user set the allowed memory in the first instance, it was exceeded without tripping our custom OutputTooLargeError. Which is to say that IMO the current usage of allowed_memory on main may be problematic. I don't know the history of this, i.e., why the custom exception was written in the first place

The docs also mention the use of the environment variable DMODEL_ALLOWED_MEMORY when allowed_memory is None. Is that supported by this PR? I don't suggest we add support for it and instead use a different variable/name that isn't already used by resample.

That is not currently supported by this PR and I agree that whether this should be supported depends on if we use an additional/separate parameter for the allowed memory in the median computation.

I'm not sure picking a one-size-fits-all buffer size will ever work

I feel as though having a patchwork of multiple different parameters governing different aspects of the memory usage is confusing for users. If I were running the pipeline, I would want to be able to specify a total memory allocation not to be exceeded. This might be more challenging to implement, but it's not obvious to me why it would never work.

braingram commented 3 weeks ago

I am in favor of exposing control of the "running-median" buffer size to the user and I think a different option is easier to understand (and document) than trying to re-use allowed_memory.

As far as the default I think 1 or 10 MB per model is closer to what we want (ideally) instead of 50% of remaining RAM and swap (which will almost certainly swap). This is likely inefficient for the current code (which uses datamodels for the temporary storage). This is separate from this PR but I think we could see (possibly bigger) efficiency gains if we instead used "vanilla" numpy arrays for the temporary storage. For that configuration I don't know for certain but I have a hunch (based on the zarr default) that a smaller buffer size would be better (or perhaps picking one that most closely matches a page size).

What do you think about changing this PR to:

emolter commented 3 weeks ago

I think we need some additional opinions here. I don't think my questions/concerns from my previous comment are addressed by your reply, so let me try to add some info and context so we can hopefully resolve this.

Overall, I'm of the fairly strong opinion that regardless of whether we end up with one parameter or two, there is no reason why the default buffer size should be any absolute number of megabytes per model. The two numbers that are relevant here are the available memory on the machine and the number of models, and the program can figure out both. It seems to me that the fastest processing would occur with the minimum number of load operations, which would happen when all available memory is being used (with the caveat that we should avoid using swap as you pointed out).

I don't have any problem with separating this parameter from allowed_memory per se, but I don't really understand the point of allowed_memory as it currently exists, and especially not if the total available memory isn't being checked in any way during the median computation. Same question for the environment variable DMODEL_ALLOWED_MEMORY: in what way(s) is it currently used?

I understand your concern that combining the parameters might lead to confusion, but let me give you an example where separating them leads to confusion: say you're processing 100 images with substantial spatial overlap. Say each image is 50 MB. The output resampled model might not be much larger than each individual integration, call it 250 MB. The available_memory parameter comes out to 500 MB, and the user figures that's fine, thinking that it should easily contain their output mosaic. But that's not what happens: the step runs out of memory during compute_median (without raising our OutputTooLargeError) because it attempted to load the default 10 MB per image times 100 images for a total of 1 GB memory.

braingram commented 3 weeks ago

I think we need some additional opinions here. I don't think my questions/concerns from my previous comment are addressed by your reply, so let me try to add some info and context so we can hopefully resolve this.

Thanks for the extra details. I didn't mean to dismiss (or miss) any questions or concerns.

Overall, I'm of the fairly strong opinion that regardless of whether we end up with one parameter or two, there is no reason why the default buffer size should be any absolute number of megabytes per model.

One reason is that the file IO will occur as blocks. Assuming all disk IO is using a 4k block size and the stars align so that one section from one model is 4k, the disk IO for computing the median using 4k from each model would be the same as reading in all models (assuming the OS doesn't over-read the file to try to fill a memory page). If we pick a section that is too small (<4k) than each section will result in reading in a block (4k) and then throwing out the extra memory. Picking a bigger section is less of an issue for IO (but the ideal would still be block aligned, 8k, 12k etc).

The two numbers that are relevant here are the available memory on the machine and the number of models, and the program can figure out both. It seems to me that the fastest processing would occur with the minimum number of load operations, which would happen when all available memory is being used (with the caveat that we should avoid using swap as you pointed out).

Thanks for the details here. Extending this description, assuming infinite RAM the best option would be to set in_memory=True.

So for the median calculation if we consider the:

If "section" is too small (smaller than a filesystem block) IO will likely limit performance. If "buffer" is too large (larger than available RAM) memory will likely limit performance (as the OS will start swapping).

Are we in agreement that we want to avoid those 2 conditions? If so, I think the question is where do we target in between those two extremes?

I don't have any problem with separating this parameter from allowed_memory per se, but I don't really understand the point of allowed_memory as it currently exists, and especially not if the total available memory isn't being checked in any way during the median computation. Same question for the environment variable DMODEL_ALLOWED_MEMORY: in what way(s) is it currently used?

I agree that allowed_memory and DMODEL_ALLOWED_MEMORY are confusing. Using a different parameter allows this confusion to be a separate issue from this PR.

emolter commented 3 weeks ago

Are we in agreement that we want to avoid those 2 conditions? If so, I think the question is where do we target in between those two extremes?

I totally agree. And I guess it would be good to issue a warning if a user parameter causes the code to run up against either extreme. As far as where to target, I assumed that one would want to use as much memory as possible (without using swap) so that the number of sections per model is minimized (and therefore the models need to be loaded as few times as possible. Do you agree with that philosophy?

braingram commented 3 weeks ago

I totally agree. And I guess it would be good to issue a warning if a user parameter causes the code to run up against either extreme. As far as where to target, I assumed that one would want to use as much memory as possible (without using swap) so that the number of sections per model is minimized (and therefore the models need to be loaded as few times as possible. Do you agree with that philosophy?

I'd so no :)

One example is our regression tests. They're currently run in parallel. If outlier detection decides to use all the RAM any other parallel test will start swapping when it tries to allocate memory.

Another example is HPC users who might have jobs tuned for processing jwst data. Does psutil take into consideration memory limits for job queueing systems like slurm? If a user has a job with a defined (often required) memory limit if we make the pipeline use a buffer based on the free RAM (instead of the memory limit) we're likely to pick a value higher than the limit and get the job killed.

emolter commented 3 weeks ago

These are fair points. If I hear what you're saying, there is no "good" default that works in every case, and I think I agree. But isn't that sort of the point of exposing the parameter? The two use-cases you brought up are somewhat "custom" in the sense we wouldn't probably expect most users to be doing that.

In the case that someone is running jobs in parallel, they would be most likely to scale the memory available to each node according to the number of nodes. So I'd assume it would still be desirable to set the total allowed memory by outlier detection, and the parameter should therefore parametrize that total. Similarly for the queuing system. In neither case does it make sense to me to parametrize a per-model memory allocation, unless you were somehow computing the median itself in parallel over the chunks (which we do not support).

I could see a case for no default at all, i.e., if in_memory is set to False then the user MUST specify a memory allocation. But for the reasons I stated above I still think it shouldn't be on a per-model basis.

braingram commented 3 weeks ago

These are fair points. If I hear what you're saying, there is no "good" default that works in every case, and I think I agree. But isn't that sort of the point of exposing the parameter? The two use-cases you brought up are somewhat "custom" in the sense we wouldn't probably expect most users to be doing that.

I'm all for exposing some parameter. I don't think either of these are uncommon use cases (the regtests run every night and of the few help desk issues I've seen and the github issues from external users many of them were running on HPCs).

In the case that someone is running jobs in parallel, they would be most likely to scale the memory available to each node according to the number of nodes. So I'd assume it would still be desirable to set the total allowed memory by outlier detection, and the parameter should therefore parametrize that total. Similarly for the queuing system. In neither case does it make sense to me to parametrize a per-model memory allocation, unless you were somehow computing the median itself in parallel over the chunks (which we do not support).

Currently there is no parameter exposed. We agree that exposing something could be useful. I think the options at this point in the conversation are:

These are similar to the 2 extremes we'd like to avoid mentioned above:

If we pick only one, we accept that the other won't be configurable and there are circumstances where that will be problematic. What if we picked both with "sensible" defaults?

As far as I know, no default settings prior to this PR set a limit on the maximum memory used by the pipeline. Is that your understanding as well?

I could see a case for no default at all, i.e., if in_memory is set to False then the user MUST specify a memory allocation. But for the reasons I stated above I still think it shouldn't be on a per-model basis.

I don't think that will work as in_memory is set to False by default.

I think my main concern for this PR is the change in the default behavior. Where on main the outlier detection median computation is largely IO bound (and memory efficient) with an option to disable this by setting in_memory=True and with this PR it's memory hungry even with in_memory=False.

emolter commented 3 weeks ago

Maybe I'm missing something, but in the proposal above, isn't buffer size simply section size * number of models? Don't they depend on one another? What is the point of setting section_size to anything smaller than buffer size / number of models?

As far as I know, no default settings prior to this PR set a limit on the maximum memory used by the pipeline. Is that your understanding as well?

The code currently checks if the size of the output array for resample is larger than the available memory (virtual+swap)*allowed_memory. But that's the only check I know about.

Where on main the outlier detection median computation is largely IO bound (and memory efficient) with an option to disable this by setting in_memory=True and with this PR it's memory hungry even with in_memory=False.

But the code runs much faster when we push it toward the memory-hungry side. See my comment on the JIRA ticket. And FWIW I think the default should be changed to in_memory=True, although that is probably beyond the scope of the PR.

braingram commented 3 weeks ago

Maybe I'm missing something, but in the proposal above, isn't buffer size simply section size * number of models? Don't they depend on one another? What is the point of setting section_size to anything smaller than buffer size / number of models?

Yeah they depend on one another. On main section_size is hard-coded and buffer size is computed. With this PR buffer size has a default value (50% ram and swap), is user configurable and section_size is computed. We have to pick (at least) one and give it a default. I think your question assumes we're exposing only buffer size (and computing section size). If we go with this approach, is there an alternative to using some % of free ram and swap for the default (which as discussed above has issues)? If not, I think that argues that picking a default section size makes more sense as we can default it to a small typical block size.

The code currently checks if the size of the output array for resample is larger than the available memory (virtual+swap)*allowed_memory. But that's the only check I know about.

That's only used if allowed_memory is defined by the user (the default behavior is no limit): https://github.com/spacetelescope/jwst/blob/0e86465f92a6ab397e3f582558fbfaabfd1edc65/jwst/resample/resample.py#L155-L158

But the code runs much faster when we push it toward the memory-hungry side. See my comment on the JIRA ticket. And FWIW I think the default should be changed to in_memory=True, although that is probably beyond the scope of the PR.

I'll take a look at the ticket. Did you try using numpy arrays and not datamodels as temporary files?

emolter commented 3 weeks ago

I think your question assumes we're exposing only buffer size (and computing section size)

No, I just got confused about your comment that we should expose both. I cannot picture a scenario where both would be tuned separately.

Did you try using numpy arrays and not datamodels as temporary files?

No, that is beyond the scope of this PR.

is there an alternative to using some % of free ram and swap for the default (which as discussed above has issues)?

I think we still differ in opinion on how serious those issues are. For the regression tests, I would argue that in_memory=True should be used all the time (except when explicitly testing on-disk mode) because the datasets are small by definition. I agree that computing the default may run into issues for job queuing systems, but in that case the parameter would be exposed for the user to tune. I'm still waiting to hear back from Team Coffee about the setup in ops, but if they prefer using environment variables to govern/modify the memory behavior, then once again I feel it would be easier for them to set the amount of total memory, i.e., buffer, so they don't have to know how many image models are being fed into each run.

picking a default section size makes more sense as we can default it to a small typical block size

Repeating my findings from the ticket, there was a factor-of-7 speedup when changing the section size from ~1 MB to ~100 MB. I'm very hesitant to set a default behavior that throws away that speedup.

braingram commented 3 weeks ago

I'm not suggesting that we do this but I'd like to provide a point of comparison with the approach taken in this PR. It's my interpretation that increasing the memory footprint for outlier detection is primarily to work around the poor IO performance for the median calculation. I believe this is primarily because datamodels are used as the temporary storage. Here's a branch that uses zarr for the median calculation: https://github.com/spacetelescope/jwst/compare/master...braingram:jwst:array_library I used zarr to make the chunk math easier (I don't think zarr is necessary and it's currently not a jwst dependency).

Running the example association discussed in the linked JP ticket with this branch takes 76.4 seconds to compute the median when using a 1 MB chunk size (and 76.8 for a 10 MB) chunk size. Yes this type of change is outside the scope of this PR but I believe it addresses the root cause and is much less sensitive to this parameter (the JP ticket mentions runtimes of ~197 seconds for a 7 MB section size and 159 for a 100 MB section size). It is also closer to the "in memory" performance of 33 seconds as mentioned in the JP ticket.

emolter commented 3 weeks ago

It's my interpretation that increasing the memory footprint for outlier detection is primarily to work around the poor IO performance for the median calculation.

I agree completely.

I believe this is primarily because datamodels are used as the temporary storage.

This makes a lot of sense, because loading and saving datamodels has lots of overhead.

Running the example association discussed in the linked JP ticket with this branch takes 76.4 seconds to compute the median when using a 1 MB chunk size (and 76.8 for a 10 MB) chunk size. Yes this type of change is outside the scope of this PR but I believe it addresses the root cause and is much less sensitive to this parameter (the JP ticket mentions runtimes of ~197 seconds for a 7 MB section size and 159 for a 100 MB section size). It is also closer to the "in memory" performance of 33 seconds as mentioned in the JP ticket.

I agree this gets around the problem. If we could decrease the sensitivity of the runtime to the chunk size, then I would have absolutely no issues with your proposed solution of leaving the buffer size a per-model quantity. In fact, in that case I might be in favor of not exposing any parameter at all, and leaving everything else as it is on main. The whole reason I wrote the ticket was because of strong sensitivity of runtime on that parameter, but if that went away, I don't see a good reason we couldn't get away with, as you suggested in the first place, relatively small chunk sizes. Then we could, for example, add a warning when chunk_size * n_models > RAM and an OutputTooLargeError when chunk_size * n_models > RAM + swap.

In conclusion I think that's a great idea and it doesn't seem particularly difficult, besides that we would need to do a small amount of wheel-reinventing to rebuild what zarr currently does.

If we are going to do that eventually, I'd say we should do it right away instead of putting a band-aid solution in there - the 10 MB buffer that's the current behavior hasn't broken anything I'm aware of, so even if that doesn't make it into the next build I don't think it's the end of the world.

What do you think of that solution @perrygreenfield and @nden ?

perrygreenfield commented 2 weeks ago

I'm going to add a few general comments on the above discussion, which I have not read in complete detail but I think I get most of the gist of it. Like Brett, I am wary of making a fraction of available memory a default since we don't know the context that the code will be run under (e.g., parallel processses, or other memory intensive tasks that will be run after this starts. I'm also wary of focusing on a per model memory allocation, since that scales with the number of models, and it is the large model case that is causing problems. I think I would prefer a size of the all the sections as the allocation limit. Since we now have computers with GB's of memory, allocating any limit that results in 1 MB per model or anything in that range seems way too small as a default. With the current I/O that will only end up with massive rereads of FITS data extensions (correct?). Anything to minimize that would be a good thing.

Figuring out how much memory a process should use is fraught with complications unless one is in control of all the processes that may be running and knows what the other processes need, which is not usually the case with most users (and may not be the case even in well controlled operations). But having the option to specify it that way probably is a useful thing, but shouldn't be the default in my opinion.

emolter commented 2 weeks ago

Thanks for the response and suggestions Perry. When you say

having the option to specify it that way probably is a useful thing, but shouldn't be the default in my opinion.

does that mean you would be in favor of making in_memory=True the default for the step?

perrygreenfield commented 2 weeks ago

It's been a while. Remind me what in_memory=True means in this case.

braingram commented 2 weeks ago

I'm going to add a few general comments on the above discussion, which I have not read in complete detail but I think I get most of the gist of it. Like Brett, I am wary of making a fraction of available memory a default since we don't know the context that the code will be run under (e.g., parallel processses, or other memory intensive tasks that will be run after this starts. I'm also wary of focusing on a per model memory allocation, since that scales with the number of models, and it is the large model case that is causing problems. I think I would prefer a size of the all the sections as the allocation limit. Since we now have computers with GB's of memory, allocating any limit that results in 1 MB per model or anything in that range seems way too small as a default. With the current I/O that will only end up with massive rereads of FITS data extensions (correct?). Anything to minimize that would be a good thing.

If we don't use FITS we can make it not a problem :)

The zarr prototype computes the chunk size based one (n_models, n_rows, n_cols) where n_rows and n_cols is either 1 or a larger number keeping the total chunk size < 1 (or 10 MB). This shape was chosen so that the median calculation can occur for each chunk (each chunk is read only once) but this does make writing less efficient since each model will result in writing all chunks (there are likely ways to improve this since some of data will be nan so we don't need to write those chunks).

Let me try to explain the current idea in my head. I doubt it's complete and help in pointing out issues and improvements is appreciated.

For the current in_memory=False mode the code:

The most efficient in_memory=False mode I can think of would be:

This would allow reading and writing temporary files only once (although writing would be incremental through the appending of each section) and the total memory used would be largely based on the median image size. I think a sensible default for the temporary file size would be the same as the input data size but maybe there's a different optimal.

I hope that makes sense but please let me know if there are any questions. I'm not 100% certain the appending sections to the temporary files will work but I think it should be doable (I just haven't done it yet so there might be an array ordering issue).

emolter commented 2 weeks ago

It's been a while. Remind me what in_memory=True means in this case.

When in_memory=True the entire input ModelLibrary, i.e., list of image models, is already in memory, which means that the median computation is just done using np.nanmedian(np.array(model_list), axis=0)._get_sections()` is entirely bypassed and any temp file writing / memory handling is irrelevant - the assumption is that the machine has enough memory to store everything. This mode is the fastest (but obviously also the most memory-intensive) and is the best choice for smaller datasets. I haven't checked, but I would have guessed that most datasets that need to be processed fall into this category since as you say computers these days have access to memory on the order of gigabytes.

perrygreenfield commented 2 weeks ago

Regarding in_memory=True, suppose that if the projected memory needs is less than some specified memory limit (or computed from it) than it probably should be automatically True if not specified. In other words, it could be specified by the user, but if not and a memory limit is specified by the user or it is below the default value for the memory limit (2 GB?), it is set to True if it will all fit in that amount of memory. At least that would relieve the user from worrying about most cases perhaps.

perrygreenfield commented 2 weeks ago

Regarding @braingram comment about the False case, that is essentially what I had in mind, at least as a temporary solution around the problems of reopening the existing format(s).

emolter commented 2 weeks ago

This PR is being withdrawn in light of the fact that the memory handling should be changed so as to make the runtime of the step insensitive to the default chunk size; see https://github.com/spacetelescope/jwst/issues/8774