ktindiana / sphinxval

SPHINX validation code for solar energetic particle models
MIT License
3 stars 3 forks source link

Performance Optimizations: Round 2 #139

Open rickyegeland opened 2 days ago

rickyegeland commented 2 days ago

SPHINX performance was greatly improved with PR #138. However, reprocessing of the entire Scoreboard dataset still requires >2 days of processing time, and full reprocessing will be done more frequently as development is active and bug fixes and new features added.

Here is a new cProfile and pstats performance report. This time I profiled the entire bin/sphinx.py command, running on the 202109 month with --Resume for the dataframe up through 202108. That is the same input I profiled on last time, but now I am profiling the latest version (d111f98) with the validation optimizations.

These reports show that setup_match_all_forecasts() is now the heaviest part of the code (2948 s), followed by load_objects_from_json() (1162 s). The latter can probably only be optimized by using a different json decoder, and it looks like there are faster options. The former might be attacked by looking at other calls down the stack, like does_win_overlap(), pred_and_obs_overlap(), calculate_derived_quantities().

Note that I should also profile a later month where the execution times are longer and the validation step starts to outpace the input/matching steps again. The list ranking will change.

Completion of this Issue is arbitrary, as optimization can be done forever, but one stopping point could be defined as when the slowest function calls described above have been optimized or at least investigated.

=== Top 20 sphinxval calls: ===

>>> p.sort_stats(SortKey.CUMULATIVE).print_stats('sphinxval', 20)
Wed Sep 25 21:24:14 2024    sphinx_202109.prof

         6302607486 function calls (6289558451 primitive calls) in 5637.992 seconds

   Ordered by: cumulative time
   List reduced from 11748 to 313 due to restriction <'sphinxval'>
   List reduced from 313 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.004    0.004 5452.997 5452.997 /home/m_sphinx/test_reqs/sphinxval/sphinxval/sphinx/sphinx.py:36(validate)
        1    1.479    1.479 3840.782 3840.782 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:2431(match_all_forecasts)
        1    7.120    7.120 2948.709 2948.709 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:2141(setup_match_all_forecasts)
        1    2.941    2.941 1162.914 1162.914 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:243(load_objects_from_json)
        2    1.444    0.722 1037.347  518.673 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:156(read_json_list)
   136730    1.260    0.000 1026.891    0.008 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:39(read_in_json)
    51236    0.148    0.000  935.533    0.018 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:330(does_win_overlap)
    51236   71.846    0.001  935.324    0.018 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:289(pred_and_obs_overlap)
    51236    0.712    0.000  838.407    0.016 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:1878(calculate_derived_quantities)
      714    0.023    0.000  835.367    1.170 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:1778(read_in_time_profiles)
      744   90.386    0.121  778.732    1.047 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/time_profile.py:86(read_single_time_profile)
 29026553   77.202    0.000  692.634    0.000 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:80(zulu_to_time)
    51236   37.392    0.001  602.987    0.012 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:1195(observed_ongoing_event)
   256180    7.337    0.000  444.909    0.002 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:503(is_time_before_thresh)
        1    0.274    0.274  440.335  440.335 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation.py:3565(intuitive_validation)
   204944    6.554    0.000  438.449    0.002 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:460(is_time_before)
        5    0.050    0.010  390.039   78.008 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation.py:3410(calculate_intuitive_metrics)
    51236    1.656    0.000  329.352    0.006 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:634(onset_peak_criteria)
    51236    2.254    0.000  318.864    0.006 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:719(max_flux_criteria)
   103238   23.402    0.000  311.896    0.003 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:544(<listcomp>)

<pstats.Stats object at 0x7f2f49a3da20>

=== Top 20 calls (no filter) ===

>>> p.sort_stats(SortKey.CUMULATIVE).print_stats(20)
Wed Sep 25 21:24:14 2024    sphinx_202109.prof

         6302607486 function calls (6289558451 primitive calls) in 5637.992 seconds

   Ordered by: cumulative time
   List reduced from 11748 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1749/1    0.039    0.000 5640.314 5640.314 {built-in method builtins.exec}
        1    0.273    0.273 5640.314 5640.314 bin/sphinx.py:1(<module>)
        1    0.004    0.004 5452.997 5452.997 /home/m_sphinx/test_reqs/sphinxval/sphinxval/sphinx/sphinx.py:36(validate)
        1    1.479    1.479 3840.782 3840.782 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:2431(match_all_forecasts)
        1    7.120    7.120 2948.709 2948.709 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:2141(setup_match_all_forecasts)
113368214  255.747    0.000 1488.743    0.000 /home/m_sphinx/test_reqs/venv/lib/python3.10/site-packages/pandas/core/series.py:943(__getitem__)
        1    2.941    2.941 1162.914 1162.914 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:243(load_objects_from_json)
        2    1.444    0.722 1037.347  518.673 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:156(read_json_list)
   136730    1.260    0.000 1026.891    0.008 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:39(read_in_json)
113370946  124.693    0.000  991.748    0.000 /home/m_sphinx/test_reqs/venv/lib/python3.10/site-packages/pandas/core/series.py:1052(_get_value)
    51236    0.148    0.000  935.533    0.018 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:330(does_win_overlap)
    51236   71.846    0.001  935.324    0.018 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:289(pred_and_obs_overlap)
    51236    0.712    0.000  838.407    0.016 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:1878(calculate_derived_quantities)
      714    0.023    0.000  835.367    1.170 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:1778(read_in_time_profiles)
      744   90.386    0.121  778.732    1.047 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/time_profile.py:86(read_single_time_profile)
113367744   95.190    0.000  739.693    0.000 /home/m_sphinx/test_reqs/venv/lib/python3.10/site-packages/pandas/core/indexes/base.py:5660(_get_values_for_loc)
2793884/1868233    7.403    0.000  701.081    0.000 /home/m_sphinx/test_reqs/venv/lib/python3.10/site-packages/pandas/core/ops/common.py:55(new_method)
 29026553   77.202    0.000  692.634    0.000 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/validation_json_handler.py:80(zulu_to_time)
    51236   37.392    0.001  602.987    0.012 /home/m_sphinx/test_reqs/sphinxval/sphinxval/utils/match.py:1195(observed_ongoing_event)
 29026553   22.015    0.000  590.337    0.000 {built-in method strptime}

<pstats.Stats object at 0x7f2f49a3da20>
lukestegeman commented 1 day ago

In match.pred_and_obs_overlap(), we should be able to improve the performance a bit by replacing

pred_interval = pd.Interval(pd.Timestamp(pred_win_st), pd.Timestamp(pred_win_end))
overlaps_bool = []
for i in range(len(obs['observation_window_start'])):
    obs_interval = pd.Interval(pd.Timestamp(obs['observation_window_start'][i]), pd.Timestamp(obs['observation_window_end'][i]))
    if pred_interval.overlaps(obs_interval):
        overlaps_bool.append(True)
    else:
        overlaps_bool.append(False)

with

overlaps_bool = ((obs['observation_window_start'] <= pred_end) & (obs['observation_window_end'] >= pred_start)).tolist()

This eliminates another instance of looping over the rows of a dataframe.