Open rickyegeland opened 2 months ago
In match.pred_and_obs_overlap()
, we should be able to improve the performance a bit by replacing
pred_interval = pd.Interval(pd.Timestamp(pred_win_st), pd.Timestamp(pred_win_end))
overlaps_bool = []
for i in range(len(obs['observation_window_start'])):
obs_interval = pd.Interval(pd.Timestamp(obs['observation_window_start'][i]), pd.Timestamp(obs['observation_window_end'][i]))
if pred_interval.overlaps(obs_interval):
overlaps_bool.append(True)
else:
overlaps_bool.append(False)
with
overlaps_bool = ((obs['observation_window_start'] <= pred_end) & (obs['observation_window_end'] >= pred_start)).tolist()
This eliminates another instance of looping over the rows of a dataframe.
@lukestegeman okay, I have updated SPHINX with this code.
SPHINX performance was greatly improved with PR #138. However, reprocessing of the entire Scoreboard dataset still requires >2 days of processing time, and full reprocessing will be done more frequently as development is active and bug fixes and new features added.
Here is a new
cProfile
andpstats
performance report. This time I profiled the entirebin/sphinx.py
command, running on the 202109 month with--Resume
for the dataframe up through 202108. That is the same input I profiled on last time, but now I am profiling the latest version (d111f98) with the validation optimizations.These reports show that
setup_match_all_forecasts()
is now the heaviest part of the code (2948 s), followed byload_objects_from_json()
(1162 s). The latter can probably only be optimized by using a different json decoder, and it looks like there are faster options. The former might be attacked by looking at other calls down the stack, likedoes_win_overlap()
,pred_and_obs_overlap()
,calculate_derived_quantities()
.Note that I should also profile a later month where the execution times are longer and the validation step starts to outpace the input/matching steps again. The list ranking will change.
Completion of this Issue is arbitrary, as optimization can be done forever, but one stopping point could be defined as when the slowest function calls described above have been optimized or at least investigated.