iiasa / rime

Rapid Impact Model Emulator
GNU General Public License v3.0
7 stars 1 forks source link

How to deal with peaking? #34

Open perrette opened 1 week ago

perrette commented 1 week ago

As far as peaking, I am not sure whether any issues that arise (I have not investigated it first hand but I do remember it being mentioned during Katja Frieler's work on pattern scaling) are related to the time lag or to other effects (thinking of negative atmosphere-ocean heat flux). It's not a big deal to flag any peaking and create warming table pre or post peak, but of course it would only apply to some scenarios and not other, and then there is the question of how to apply it. So probably you'd want to have two calibration tables (pre- and post-), and also diagnose pre- and post- peaking when using these tables on new temperature pathways to calculate. That is doable, but I wonder whether it's the smartest way to deal with that issue. On the one hand, it comes at a cost because it complicates the code (never a good thing for maintenance, bugs, need for testing etc), and on the other it's not clear to me how large the potential benefit is. I mean, is it really a binary thing ? (peaking, not peaking?) Maybe it depends more on the slope, or maybe on the integral of past warming (heat uptake)... And splitting the training data in two, with rather few scenarios providing training data for the post-peaking period, may introduce biases of its own. But I don't know. I'm just seeing the complications without having being confronted first-hand to the problems, and without a clear feeling that it would be a solution. My suggestion here would be to treat that as a separate issue on peaking and discuss that and try out various things when we want to dedicate time on this.

Originally posted by @perrette in https://github.com/iiasa/rime/issues/33#issuecomment-2419576120

perrette commented 1 week ago

@NiklasSchwind @cschleussner

NiklasSchwind commented 1 week ago

Hey Mahé, I think the smartest way to implement peaking within the current method would be to Introduce another conditional probability. That would be P(IM | GMT, PEAK) and P(PEAK). P(IM | GMT, PEAK) could be calculated by recording whether a sample comes from before or after peak warming, and P(PEAK) could be calculated from the MAGICC ensemble.

Then, we could instead derive the quantiles of this distribution:

P(IM) = sum_{PEAK,GMT} P(PEAK)P(GMT)P(IM | GMT, PEAK)

by weighting the samples with P(PEAK) P(GMT) P(IM | GMT, PEAK)

That said, I see all the same issues you are having. I think this approach could improve things, but I am not sure. Also, I fear that it could make our output look very unsmooth at the peak and after (as we lack training data, but also because we still have the time-lag problem). Thus I am not able to predict if this way of solving the peaking issue would improve things or make things worse (= Has to be tried).

That said, I still also have this other idea of dealing with timelags and peaking:

A Monte Carlo method where we first pull a GMT change rate and GMT from the MAGICC distribution and then we go to a bin for this GMT change rate and GMT and pull an indicator change rate from there (to implement this our record would need to include GMT change rate and indicator change rate and for the emulator we would need to implement a completely new alternative function). Then we produce 10k+ realizations by adding up all of those randomly selected change rates and take our quantiles from this ensemble. How well this method works depends on the trainingdata as well (we need at least CMIP-6 as there is no overshoot scenario in ISIMIP, however this could actually also help with timelag). Also, I think we would need to experiment with the GMT and GMT change windows that define a bin.

Mathematically this method would be based on iteratively convolving the distribution of the indicator in a given timestep and the distribution of the change rate to get the distribution of the indicator in the next timestep. Assumptions going into it is that the change rate of an indicator depends on the GMT level and the change rate of GMT; I guess this is a weaker assumption than the indicator value depending on the GMT value (which allows us to account for time-lag and peaking).

Let me know what you think about either of those methods @perrette and @cschleussner :) Happy to have a discussion

perrette commented 1 week ago

Hey Niklas, regarding the first part, aren't you formalizing what we were already saying? I.e. using the training data from before peaking when the magicc ensemble member is before peaking, and conversely for after peaking? In any case, that's also how I'd do it, with the caveats and tradeoffs discussed above. For the second part, adding more predictor is always an option. Monte Carlo or binning as we're doing now only impact the calculation time and complexity of the code (MC is simpler). The training data already includes the rate of warming and also accumulated warming, by the way. Note also #35 .

perrette commented 1 week ago

W.r.t code complexity, it's much simpler than I first thought, because it turns out the calibration data is the same (using all years), and it's only a matter of filtering the input data at the time of emulation. I.e. it can be all contained in a single function without worrying about where which data is, folder naming etc. So it's actually easy to experiment with it.

NiklasSchwind commented 1 week ago

"Hey Niklas, regarding the first part, aren't you formalizing what we were already saying? I.e. using the training data from before peaking when the magicc ensemble member is before peaking, and mutatis mutandis for after peaking? In any case, that's also how I'd do it, with the caveats and tradeoffs discussed above."

I thought that introducing the P(PEAK) probability would be new, but I might have overread something. Anyway, I agree that this would probably be kinda natural to add to the codebase.
About the additional predictor: Note that this is not the only "innovation" in the second proposal. The other one is predicting indicator change rate instead of indicator value and deriving the indicator values by adding up the change rates. I think this would work smoother with less training data in each bin.

"W.r.t code complexity, it's much simpler than I first thought, because it turns out the calibration data is the same (using all years), and it's only a matter of filtering the input data at the time of emulation. I.e. it can be all contained in a single function without worrying about where which data is, folder naming etc. So it's actually easy to experiment with it."

I think this also holds for the second proposal. The only thing you would need to add to the existing code is additionally recording GMT and indicator change rate in the training simulations. Then the monte carlo simulation of adding up the change rates could be implemented in a second emulator function. So while writing this extra emulator function will probably take quite some time, I think that it shouldn't make the codebase too much more complex.

perrette commented 1 week ago

yes, it's all quite simple to experiment with because all methods shared the same calibration data that only needs to be filtered If I understood your P(peak) correctly that's what I meant in the second part of this sentence:

So probably you'd want to have two calibration tables (pre- and post-), and also diagnose pre- and post- peaking when using these tables on new temperature pathways to calculate.

i.e. apply pre-peak data on MAGICC temperature before the peak, and post-peak data on MAGICC temperature after the peak. Before and after the peak can be calculated on an ensemble-member basis. So we are on the same line here. Or did you have something else in mind?

NiklasSchwind commented 1 week ago

Sounds good. Thanks for clarifying :)