jbms / beancount-import

Web UI for semi-automatically importing external data into beancount
GNU General Public License v2.0
392 stars 101 forks source link

Micro-optimizations to speed up preprocessing by ~2x #214

Closed ankurdave closed 10 months ago

ankurdave commented 11 months ago

On my 12,000-transaction ledger, beancount_import.reconcile.LoadedReconciler._preprocess_entries() currently takes 11.84 seconds. I measured and profiled it as follows:

modified   beancount_import/reconcile.py
@@ -442,2 +442,5 @@ class LoadedReconciler(object):
     def _preprocess_entries(self):
+        import cProfile
+        pr = cProfile.Profile()
+        pr.enable()
         posting_db = self.posting_db
@@ -453,2 +456,4 @@ class LoadedReconciler(object):
                 self.balance_entries[key] = entry.amount.number
+        pr.disable()
+        pr.print_stats(sort='time')

This PR contains the following micro-optimizations to matching.py that brought the time down to 5.98 seconds, a 2x speedup:

  1. Avoid duplicate calls to _entry_and_posting_ids_key() in add_posting(), remove_posting(), and _get_matches().
  2. Hoist the call to _search_posting() out of the fuzzy date loop in add_posting() and remove_posting().
Zburatorul commented 10 months ago

Impressive work @ankurdave. I've never had the stomach for understand the matching logic.

Zburatorul commented 10 months ago

Also, please paste the profiler call table for posterity :)

Zburatorul commented 10 months ago

Wrong button. Meant to merge, not close.

ankurdave commented 10 months ago

Before:

         20664200 function calls in 11.972 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1017600    2.767    0.000    4.268    0.000 matching.py:215(_entry_and_posting_ids_key)
    84800    1.272    0.000    9.529    0.000 matching.py:283(add_posting)
   932800    1.229    0.000    6.268    0.000 matching.py:272(_search_posting)
  3924816    1.004    0.000    1.368    0.000 matching.py:216(<genexpr>)
  1017600    0.895    0.000    0.895    0.000 matching.py:256(get_fuzzy_date_range)
  5069708    0.622    0.000    0.622    0.000 {built-in method builtins.id}
   476638    0.493    0.000    0.997    0.000 matching.py:871(posting_set_id)
  1130506    0.432    0.000    0.673    0.000 matching.py:196(get_posting_weight)
  1017600    0.397    0.000    1.292    0.000 matching.py:261(fuzz_date_currency_key)
  1607694    0.380    0.000    0.503    0.000 matching.py:872(<genexpr>)
    57773    0.357    0.000    1.633    0.000 matching.py:884(add_subset)
     8944    0.277    0.000    0.277    0.000 {built-in method builtins.max}
  1263068    0.267    0.000    0.267    0.000 {built-in method __new__ of type object at 0x10e96ffa8}
    12181    0.134    0.000    2.426    0.000 __init__.py:731(wrapper)
  1018444    0.119    0.000    0.119    0.000 {method 'append' of 'list' objects}
    91051    0.089    0.000    0.164    0.000 amount.py:42(__new__)
    96981    0.086    0.000    2.653    0.000 matching.py:930(get_matchable_postings)
     9968    0.078    0.000    0.078    0.000 {method 'items' of 'collections.OrderedDict' objects}
    12181    0.077    0.000   12.370    0.001 matching.py:375(add_transaction)
    57773    0.075    0.000    0.129    0.000 {built-in method builtins.sum}
    84800    0.074    0.000    0.202    0.000 matching.py:265(_date_currency_key)
    34304    0.066    0.000    0.178    0.000 booking_full.py:680(convert_costspec_to_cost)
     9968    0.062    0.000    1.784    0.000 matching.py:856(_get_uncleared_aggregate_posting_candidates)
   273197    0.054    0.000    0.054    0.000 matching.py:885(<genexpr>)
    84800    0.048    0.000    0.075    0.000 matching.py:191(is_unknown_account)
    12181    0.045    0.000    2.513    0.000 matching.py:812(get_aggregate_posting_candidates)
    84800    0.041    0.000    0.082    0.000 matching.py:211(_date_key)
    84800    0.034    0.000    0.054    0.000 <string>:1(<lambda>)
    84800    0.034    0.000    0.041    0.000 posting_date.py:5(get_posting_date)
    34304    0.033    0.000    0.085    0.000 __init__.py:452(_replace)
     9968    0.033    0.000    0.433    0.000 __init__.py:72(__setitem__)
   258325    0.033    0.000    0.033    0.000 {built-in method builtins.isinstance}
    12181    0.027    0.000    0.092    0.000 matching.py:927(<listcomp>)
    84793    0.027    0.000    0.027    0.000 {method 'startswith' of 'str' objects}
    21125    0.023    0.000    0.054    0.000 __init__.py:179(__getitem__)
    21125    0.023    0.000    0.028    0.000 __init__.py:66(__getitem__)
    34304    0.020    0.000    0.052    0.000 __init__.py:442(_make)
     8944    0.019    0.000    0.312    0.000 __init__.py:610(most_common)
    24714    0.019    0.000    0.029    0.000 reconcile.py:460(is_posting_cleared)
    97051    0.017    0.000    0.017    0.000 {method 'get' of 'dict' objects}
   106943    0.015    0.000    0.015    0.000 {built-in method builtins.len}
     9968    0.015    0.000    0.449    0.000 __init__.py:185(__setitem__)
     8944    0.015    0.000    0.396    0.000 __init__.py:193(popitem)
    12181    0.015    0.000    0.024    0.000 matching.py:856(<lambda>)
     8944    0.015    0.000    0.069    0.000 __init__.py:111(pop)
     8944    0.012    0.000    0.291    0.000 heapq.py:523(nlargest)
    12181    0.010    0.000    0.111    0.000 matching.py:965(get_matchable_postings_from_transaction)
     8944    0.010    0.000    0.012    0.000 __init__.py:88(__delitem__)
     8944    0.010    0.000    0.032    0.000 __init__.py:189(__delitem__)
    40287    0.010    0.000    0.010    0.000 matching.py:949(<genexpr>)
     8944    0.010    0.000    0.010    0.000 __init__.py:728(__delitem__)
    12181    0.008    0.000    0.101    0.000 matching.py:926(get_weighted_postings)
    26017    0.007    0.000    0.009    0.000 matching.py:856(<genexpr>)
    20101    0.006    0.000    0.006    0.000 __init__.py:93(__contains__)
     9968    0.006    0.000    0.006    0.000 __init__.py:96(__missing__)
    13836    0.006    0.000    0.006    0.000 {method 'setdefault' of 'collections.OrderedDict' objects}
    11583    0.005    0.000    0.005    0.000 ofx.py:1528(is_posting_cleared)
     9968    0.002    0.000    0.002    0.000 __init__.py:35(__setitem__)
     8944    0.002    0.000    0.002    0.000 {built-in method builtins.iter}
     8944    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
     9968    0.002    0.000    0.002    0.000 __init__.py:138(getsizeof)
     6328    0.002    0.000    0.002    0.000 {built-in method builtins.abs}
     9968    0.001    0.000    0.001    0.000 __init__.py:601(__missing__)
     8944    0.001    0.000    0.001    0.000 __init__.py:38(pop)
      521    0.001    0.000    0.001    0.000 {method 'extend' of 'list' objects}
      294    0.000    0.000    0.000    0.000 {built-in method builtins.hash}
      300    0.000    0.000    0.001    0.000 {method 'add' of 'set' objects}
      342    0.000    0.000    0.000    0.000 {built-in method builtins.min}
      379    0.000    0.000    0.000    0.000 venmo_json_source.py:268(is_posting_cleared)
      857    0.000    0.000    0.000    0.000 costco_receipt_source.py:237(is_posting_cleared)
       20    0.000    0.000    0.000    0.000 matching.py:874(partition)
      117    0.000    0.000    0.000    0.000 adp_payroll_source.py:195(is_posting_cleared)
      294    0.000    0.000    0.000    0.000 amount.py:101(__hash__)
      115    0.000    0.000    0.000    0.000 cashapp_csv_source.py:209(is_posting_cleared)
      269    0.000    0.000    0.000    0.000 matching.py:907(<lambda>)
       37    0.000    0.000    0.000    0.000 workday_payroll_source.py:211(is_posting_cleared)
       14    0.000    0.000    0.000    0.000 amount.py:149(mul)
       44    0.000    0.000    0.000    0.000 boa_mortgage_csv_source.py:175(is_posting_cleared)
        3    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
        8    0.000    0.000    0.000    0.000 amount.py:83(__eq__)
        1    0.000    0.000    0.000    0.000 journal_editor.py:296(all_entries)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

After:

         9725404 function calls in 5.913 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    84800    0.897    0.000    3.127    0.000 matching.py:283(add_posting)
  1017600    0.766    0.000    0.766    0.000 matching.py:256(get_fuzzy_date_range)
   476638    0.670    0.000    1.164    0.000 matching.py:871(posting_set_id)
  1607694    0.372    0.000    0.494    0.000 matching.py:872(<genexpr>)
  1017600    0.369    0.000    1.134    0.000 matching.py:261(fuzz_date_currency_key)
    57773    0.347    0.000    1.784    0.000 matching.py:884(add_subset)
    84800    0.347    0.000    0.482    0.000 matching.py:215(_entry_and_posting_ids_key)
     8944    0.254    0.000    0.254    0.000 {built-in method builtins.max}
  1471960    0.169    0.000    0.169    0.000 {built-in method builtins.id}
    21125    0.166    0.000    0.193    0.000 __init__.py:179(__getitem__)
  1018444    0.112    0.000    0.112    0.000 {method 'append' of 'list' objects}
   282506    0.109    0.000    0.185    0.000 matching.py:196(get_posting_weight)
    84800    0.103    0.000    0.198    0.000 matching.py:272(_search_posting)
   327068    0.090    0.000    0.123    0.000 matching.py:216(<genexpr>)
   343628    0.079    0.000    0.079    0.000 {built-in method __new__ of type object at 0x1008c6fa8}
    96981    0.077    0.000    2.703    0.000 matching.py:930(get_matchable_postings)
    57773    0.072    0.000    0.125    0.000 {built-in method builtins.sum}
    84800    0.070    0.000    0.193    0.000 matching.py:265(_date_currency_key)
    12181    0.065    0.000    5.999    0.000 matching.py:376(add_transaction)
    67231    0.064    0.000    0.117    0.000 amount.py:42(__new__)
     9968    0.057    0.000    1.853    0.000 matching.py:856(_get_uncleared_aggregate_posting_candidates)
   273197    0.054    0.000    0.054    0.000 matching.py:885(<genexpr>)
    84800    0.042    0.000    0.064    0.000 matching.py:191(is_unknown_account)
    12181    0.040    0.000    2.576    0.000 matching.py:812(get_aggregate_posting_candidates)
    84800    0.038    0.000    0.079    0.000 matching.py:211(_date_key)
    12181    0.033    0.000    2.497    0.000 __init__.py:731(wrapper)
    84800    0.033    0.000    0.040    0.000 posting_date.py:5(get_posting_date)
    84800    0.033    0.000    0.050    0.000 <string>:1(<lambda>)
     9968    0.030    0.000    0.393    0.000 __init__.py:72(__setitem__)
    12181    0.025    0.000    0.086    0.000 matching.py:927(<listcomp>)
   186855    0.023    0.000    0.023    0.000 {built-in method builtins.isinstance}
    84793    0.022    0.000    0.022    0.000 {method 'startswith' of 'str' objects}
    10494    0.021    0.000    0.058    0.000 booking_full.py:680(convert_costspec_to_cost)
    21125    0.019    0.000    0.025    0.000 __init__.py:66(__getitem__)
    24714    0.018    0.000    0.028    0.000 reconcile.py:460(is_posting_cleared)
    12181    0.017    0.000    0.027    0.000 matching.py:856(<lambda>)
     8944    0.017    0.000    0.286    0.000 __init__.py:610(most_common)
    97051    0.017    0.000    0.017    0.000 {method 'get' of 'dict' objects}
     9968    0.014    0.000    0.408    0.000 __init__.py:185(__setitem__)
     8944    0.013    0.000    0.061    0.000 __init__.py:111(pop)
     8944    0.012    0.000    0.359    0.000 __init__.py:193(popitem)
     8944    0.012    0.000    0.268    0.000 heapq.py:523(nlargest)
    10494    0.012    0.000    0.029    0.000 __init__.py:452(_replace)
    83133    0.011    0.000    0.011    0.000 {built-in method builtins.len}
    12181    0.010    0.000    0.103    0.000 matching.py:965(get_matchable_postings_from_transaction)
    40287    0.009    0.000    0.009    0.000 matching.py:949(<genexpr>)
     8944    0.009    0.000    0.010    0.000 __init__.py:88(__delitem__)
     8944    0.009    0.000    0.009    0.000 __init__.py:728(__delitem__)
     8944    0.008    0.000    0.027    0.000 __init__.py:189(__delitem__)
    26017    0.008    0.000    0.010    0.000 matching.py:856(<genexpr>)
    12181    0.007    0.000    0.093    0.000 matching.py:926(get_weighted_postings)
    10494    0.007    0.000    0.017    0.000 __init__.py:442(_make)
     9968    0.006    0.000    0.006    0.000 __init__.py:96(__missing__)
    20101    0.006    0.000    0.006    0.000 __init__.py:93(__contains__)
    13836    0.005    0.000    0.005    0.000 {method 'setdefault' of 'collections.OrderedDict' objects}
    11583    0.005    0.000    0.005    0.000 ofx.py:1528(is_posting_cleared)
     9968    0.002    0.000    0.002    0.000 __init__.py:35(__setitem__)
     9968    0.002    0.000    0.002    0.000 {method 'items' of 'collections.OrderedDict' objects}
     8944    0.002    0.000    0.002    0.000 {built-in method builtins.iter}
     9968    0.002    0.000    0.002    0.000 __init__.py:138(getsizeof)
     8944    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
     8944    0.002    0.000    0.002    0.000 __init__.py:38(pop)
     9968    0.001    0.000    0.001    0.000 __init__.py:601(__missing__)
      521    0.001    0.000    0.001    0.000 {method 'extend' of 'list' objects}
     1808    0.000    0.000    0.000    0.000 {built-in method builtins.abs}
      294    0.000    0.000    0.000    0.000 {built-in method builtins.hash}
      300    0.000    0.000    0.001    0.000 {method 'add' of 'set' objects}
      342    0.000    0.000    0.000    0.000 {built-in method builtins.min}
      857    0.000    0.000    0.000    0.000 costco_receipt_source.py:237(is_posting_cleared)
      379    0.000    0.000    0.000    0.000 venmo_json_source.py:268(is_posting_cleared)
      294    0.000    0.000    0.000    0.000 amount.py:101(__hash__)
       20    0.000    0.000    0.000    0.000 matching.py:874(partition)
      117    0.000    0.000    0.000    0.000 adp_payroll_source.py:195(is_posting_cleared)
      269    0.000    0.000    0.000    0.000 matching.py:907(<lambda>)
      115    0.000    0.000    0.000    0.000 cashapp_csv_source.py:209(is_posting_cleared)
       37    0.000    0.000    0.000    0.000 workday_payroll_source.py:211(is_posting_cleared)
        4    0.000    0.000    0.000    0.000 amount.py:149(mul)
       44    0.000    0.000    0.000    0.000 boa_mortgage_csv_source.py:175(is_posting_cleared)
        3    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
        8    0.000    0.000    0.000    0.000 amount.py:83(__eq__)
        1    0.000    0.000    0.000    0.000 journal_editor.py:296(all_entries)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}