orcestra-campaign / weather

Weather briefings for the ORCESTRA field campaign.
5 stars 1 forks source link

Avoid redundant loading of the forecast and climatological dataset #163

Closed HF7weatherman closed 5 days ago

HF7weatherman commented 1 week ago

From #160:

Furthermore, the download of the climatological data is rather slow due to the large amount of data. Therefore, it would make sense to load both, the ERA5 data as well as the forecast data not within each plot, but once before the generation of all plots. This has the potential to save a lot of time during generating the briefing as it would avoid redundant loading of the same datasets. However, also this should be addressed in a separate issue.

@sortega87 Do you thin this would make sense and is worth the effort? I think it would change the current infrastructure of the briefing quite a lot, so we should put some thought into it before we start implementing such changes... We can also discuss in person on Barbados before we start working on this.

sortega87 commented 1 week ago

Hi @HF7weatherman. I think it is worth the effort if it is affecting us in effectively generating the reports. We can perhaps keep the changes limited by using a cache (https://docs.python.org/dev/library/functools.html#functools.lru_cache) in functions such as hifs.get_forecast and _get_rolling_daily_climatology.

sortega87 commented 1 week ago

Hi @HF7weatherman. I think it is worth the effort if it is affecting us in effectively generating the reports. We can perhaps keep the changes limited by using a cache (https://docs.python.org/dev/library/functools.html#functools.lru_cache) in functions such as hifs.get_forecast and _get_rolling_daily_climatology.

This could be done by decorating as follows:

@lru_cache
def get_forecast(
        self,
        variable: str,
        briefing_time: pd.Timestamp,
        lead_hours: str,
        current_time: pd.Timestamp,
        forecast_type: str = "oper",
        differentiate: bool = False,
        differentiate_unit: str = "s",
) -> tuple[pd.Timestamp, xr.DataArray]:
    ...

This would keep a cache with the timestamp and dataarray generated on previous calls and return them if the arguments are the same.

HF7weatherman commented 1 week ago

Hi @sortega87! I wasn't aware that python offers such a cache possibility. It seems like that's exactly what we need in our case and thus I agree that we should definitely implement it. Just as a conformation for urgently implementing this change, it took more than an hour for @karstenpeters to generate the briefing this morning...

One question regarding the implementation: You wrote

This would keep a cache with the timestamp and dataarray generated on previous calls and return them if the arguments are the same.

In the case of _get_forecast(), the arguments wouldn't be exactly the same since lead_hours changes with each call. Do you know whether it would be possible to keep the whole forecast in the cache? This would then further reduce the necessity to reload data. If I see it correctly, _load_forecast_dataset() loads all lead times, so perhaps it would make sense to apply @lru_cachebefore that function?

And one last question: Who of us should implement it?

sortega87 commented 1 week ago

Hi @HF7weatherman. Actually, now I think neither _load_forecast_dataset nor get_forecast loads the data. Xarray should load the data just before the plots are generated. Thus, we must also force xarray to return a in memory dataset to effectively implement the cache: this is done by calling dataset.load() or dataset.compute() .

One problem with _load_forecast_dataset is that the dataset it loads includes all variables and timestamps. Thus, including the cache in _load_forecast_dataset and forcing it to actually load the data might result in slower times. We can however modify it to just load the variables that will be used in the briefings.

I can implement these changes once in Barbados. Is this a good timeline for the implementation?

HF7weatherman commented 1 week ago

Yes, I agree in that case it would be nice to at least implement the storage of the specific variables in memory. Doing that change on Barbados is totally sufficient. Today was the last official briefing and they won't resume before the 7th on Barbados. So implementing this before the 10th of September sounds like a reasonable timeline for me.

HF7weatherman commented 1 week ago

Furthermore, with these changes we also partly address #144. The briefing might become fast enough that #144 isn't necessary anymore.