macxred / pyledger

Python package to streamline the implementation and management of accounting systems.
MIT License
0 stars 0 forks source link

Python profiling #157

Open AlexTheWizardL opened 2 weeks ago

AlexTheWizardL commented 2 weeks ago

Profiling Instructions

Preliminary Step

Before starting profiling, checkout the dedicated profiling branch:

git checkout feat/profiling-functionality

Branch URL:
https://github.com/macxred/pyledger/tree/feat/profiling-functionality

Tools Used

Setup

pip install pyinstrument line_profiler

Run pyinstrument for High-Level Overview

python profiling.py

Run kernprof for Detailed Line-by-Line Timing

kernprof -l -v profiling.py

About profiling.py

The profiling.py file serves as the entry point for running the profiling. It is currently configured to call the account_balance method to demonstrate how profiling results are collected. This entry point can be modified to target any other method or workflow that requires performance analysis.

Sharing Profiling Results

You can share the profiling output by posting it as a comment under this issue.

lasuk commented 2 weeks ago

Running python profiling.py in commit 06a32c5 on my MacBook Air M2:

Profiling output: ``` _ ._ __/__ _ _ _ _ _/_ Recorded: 07:11:44 Samples: 13869 /_//_/// /_\ / //_// / //_'/ // Duration: 14.157 CPU time: 14.110 / _/ v5.0.1 Profile at /Users/lukas/macx/pyledger/profiling.py:6 14.156 profiling.py:1 └─ 14.150 TextLedger.account_balance pyledger/ledger_engine.py:495 └─ 14.145 TextLedger._account_balance_range pyledger/ledger_engine.py:478 └─ 14.145 TextLedger._account_balance_list pyledger/ledger_engine.py:465 └─ 14.145 TextLedger._single_account_balance pyledger/standalone_ledger.py:153 └─ 14.143 wrapper pyledger/decorators.py:22 └─ 14.143 TextLedger.serialized_ledger pyledger/standalone_ledger.py:166 ├─ 12.885 TextLedger.complete_ledger pyledger/standalone_ledger.py:175 │ ├─ 7.176 TextLedger.sanitize_journal pyledger/ledger_engine.py:850 │ │ ├─ 2.266 TextLedger._fill_report_amounts pyledger/ledger_engine.py:1068 │ │ │ └─ 2.253 TextLedger.report_amount pyledger/ledger_engine.py:1274 │ │ │ ├─ 1.283 TextLedger.round_to_precision pyledger/ledger_engine.py:1224 │ │ │ │ └─ 1.280 round_scalar pyledger/ledger_engine.py:1247 │ │ │ │ └─ 1.259 TextLedger.precision pyledger/ledger_engine.py:1467 │ │ │ │ ├─ 0.469 wrapper pyledger/decorators.py:22 │ │ │ │ │ └─ 0.466 deepcopy copy.py:118 │ │ │ │ │ [6 frames hidden] copy, pandas │ │ │ │ ├─ 0.311 Series.new_method pandas/core/ops/common.py:62 │ │ │ │ │ [2 frames hidden] pandas │ │ │ │ └─ 0.144 Series.__getitem__ pandas/core/series.py:1095 │ │ │ └─ 0.960 TextLedger.price pyledger/ledger_engine.py:1351 │ │ │ ├─ 0.473 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.465 deepcopy copy.py:118 │ │ │ │ [7 frames hidden] copy, pandas │ │ │ ├─ 0.203 DatetimeProperties.f pandas/core/accessor.py:111 │ │ │ │ └─ 0.199 DatetimeProperties._delegate_method pandas/core/indexes/accessors.py:126 │ │ │ └─ 0.155 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ ├─ 2.002 TextLedger._invalid_currency pyledger/ledger_engine.py:973 │ │ │ └─ 2.002 DataFrame.apply pandas/core/frame.py:10170 │ │ │ [2 frames hidden] pandas │ │ │ 2.000 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ │ └─ 1.988 is_invalid pyledger/ledger_engine.py:975 │ │ │ └─ 1.836 TextLedger.account_currency pyledger/ledger_engine.py:438 │ │ │ ├─ 0.837 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ │ │ [14 frames hidden] pandas │ │ │ ├─ 0.442 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.439 deepcopy copy.py:118 │ │ │ │ [7 frames hidden] pandas │ │ │ ├─ 0.303 Series.new_method pandas/core/ops/common.py:62 │ │ │ │ [4 frames hidden] pandas │ │ │ └─ 0.171 DataFrame.__getitem__ pandas/core/frame.py:4062 │ │ ├─ 1.425 TextLedger._invalid_assets pyledger/ledger_engine.py:951 │ │ │ └─ 1.424 DataFrame.apply pandas/core/frame.py:10170 │ │ │ [2 frames hidden] pandas │ │ │ 1.424 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ │ └─ 1.408 is_invalid pyledger/ledger_engine.py:953 │ │ │ └─ 1.382 TextLedger.precision pyledger/ledger_engine.py:1467 │ │ │ ├─ 0.493 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.488 deepcopy copy.py:118 │ │ │ │ [6 frames hidden] copy, pandas │ │ │ ├─ 0.343 Series.new_method pandas/core/ops/common.py:62 │ │ │ │ [2 frames hidden] pandas │ │ │ └─ 0.167 Series.__getitem__ pandas/core/series.py:1095 │ │ ├─ 1.135 TextLedger._invalid_prices pyledger/ledger_engine.py:1003 │ │ │ └─ 1.135 DataFrame.apply pandas/core/frame.py:10170 │ │ │ [2 frames hidden] pandas │ │ │ 1.134 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ │ └─ 1.118 is_invalid pyledger/ledger_engine.py:1005 │ │ │ └─ 1.055 TextLedger.price pyledger/ledger_engine.py:1351 │ │ │ ├─ 0.558 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.454 deepcopy copy.py:118 │ │ │ │ [7 frames hidden] copy, pandas │ │ │ ├─ 0.202 DatetimeProperties.f pandas/core/accessor.py:111 │ │ │ │ └─ 0.202 DatetimeProperties._delegate_method pandas/core/indexes/accessors.py:126 │ │ │ └─ 0.172 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ │ [2 frames hidden] pandas │ │ ├─ 0.175 TextLedger._invalid_tax_codes pyledger/ledger_engine.py:907 │ │ │ └─ 0.171 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:411 │ │ │ └─ 0.165 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 │ │ │ └─ 0.159 validate_currency pyledger/ledger_engine.py:375 │ │ │ └─ 0.159 DataFrame.apply pandas/core/frame.py:10170 │ │ │ [2 frames hidden] pandas │ │ │ 0.159 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ │ └─ 0.155 is_valid pyledger/ledger_engine.py:377 │ │ │ └─ 0.154 TextLedger.precision pyledger/ledger_engine.py:1467 │ │ └─ 0.161 TextLedger._invalid_accounts pyledger/ledger_engine.py:919 │ │ └─ 0.160 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:411 │ │ └─ 0.157 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 │ │ └─ 0.152 validate_currency pyledger/ledger_engine.py:375 │ │ └─ 0.152 DataFrame.apply pandas/core/frame.py:10170 │ │ [2 frames hidden] pandas │ │ 0.152 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ └─ 0.148 is_valid pyledger/ledger_engine.py:377 │ │ └─ 0.148 TextLedger.precision pyledger/ledger_engine.py:1467 │ ├─ 3.776 TextLedger.revaluation_entries pyledger/standalone_ledger.py:190 │ │ ├─ 1.012 TextLedger._balance_from_serialized_ledger pyledger/standalone_ledger.py:248 │ │ │ ├─ 0.375 TextLedger.account_currency pyledger/ledger_engine.py:438 │ │ │ │ └─ 0.172 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ │ │ [2 frames hidden] pandas │ │ │ ├─ 0.257 Series.new_method pandas/core/ops/common.py:62 │ │ │ └─ 0.241 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ │ [3 frames hidden] pandas │ │ ├─ 0.907 TextLedger.price pyledger/ledger_engine.py:1351 │ │ │ ├─ 0.444 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.440 deepcopy copy.py:118 │ │ │ │ [7 frames hidden] copy, pandas │ │ │ ├─ 0.186 DatetimeProperties.f pandas/core/accessor.py:111 │ │ │ │ └─ 0.185 DatetimeProperties._delegate_method pandas/core/indexes/accessors.py:126 │ │ │ └─ 0.148 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ ├─ 0.610 TextLedger.round_to_precision pyledger/ledger_engine.py:1224 │ │ │ └─ 0.607 round_scalar pyledger/ledger_engine.py:1247 │ │ │ └─ 0.597 TextLedger.precision pyledger/ledger_engine.py:1467 │ │ │ └─ 0.281 wrapper pyledger/decorators.py:22 │ │ │ └─ 0.281 deepcopy copy.py:118 │ │ │ [5 frames hidden] copy, pandas │ │ ├─ 0.509 TextLedger.account_currency pyledger/ledger_engine.py:438 │ │ │ └─ 0.252 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ │ [5 frames hidden] pandas │ │ ├─ 0.360 CSVJournalEntity.standardize pyledger/storage_entity.py:210 │ │ │ └─ 0.305 CSVJournalEntity.standardize pyledger/storage_entity.py:51 │ │ │ └─ 0.257 enforce_schema consistent_df/enforce_schema.py:22 │ │ │ └─ 0.257 _enforce_schema consistent_df/enforce_schema.py:91 │ │ └─ 0.234 TextLedger.serialize_ledger pyledger/ledger_engine.py:1160 │ ├─ 1.535 TextLedger.sanitize_revaluations pyledger/ledger_engine.py:1509 │ │ └─ 1.531 validate_account_prices pyledger/ledger_engine.py:1562 │ │ ├─ 0.941 TextLedger.price pyledger/ledger_engine.py:1351 │ │ │ ├─ 0.488 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.481 deepcopy copy.py:118 │ │ │ │ [7 frames hidden] copy, pandas │ │ │ ├─ 0.182 DatetimeProperties.f pandas/core/accessor.py:111 │ │ │ │ └─ 0.181 DatetimeProperties._delegate_method pandas/core/indexes/accessors.py:126 │ │ │ └─ 0.155 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ │ └─ 0.142 _LocIndexer._getitem_tuple pandas/core/indexing.py:1365 │ │ └─ 0.541 TextLedger.account_currency pyledger/ledger_engine.py:438 │ │ └─ 0.234 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ [5 frames hidden] pandas │ └─ 0.379 TextLedger.tax_entries pyledger/standalone_ledger.py:60 └─ 1.254 wrapper pyledger/decorators.py:22 └─ 1.253 CSVJournalEntity.list pyledger/storage_entity.py:468 └─ 1.253 CSVJournalEntity._read_data pyledger/storage_entity.py:655 └─ 1.243 CSVJournalEntity._read_data pyledger/storage_entity.py:540 └─ 1.027 CSVJournalEntity.standardize pyledger/storage_entity.py:210 ├─ 0.706 CSVJournalEntity.standardize pyledger/storage_entity.py:51 │ └─ 0.564 enforce_schema consistent_df/enforce_schema.py:22 │ └─ 0.564 _enforce_schema consistent_df/enforce_schema.py:91 │ ├─ 0.149 DataFrame.__setitem__ pandas/core/frame.py:4271 │ └─ 0.146 DataFrame.query pandas/core/frame.py:4680 └─ 0.156 DataFrame.query pandas/core/frame.py:4680 └─ 0.143 DataFrame.eval pandas/core/frame.py:4846 ```
(Redacted) (concise) Chat GPT Summary of profiling output: **Overview** - **Total time:** ~14.16s - **Main bottleneck:** `TextLedger.account_balance()` – **14.15s** - **Key Contributors** - **`serialized_ledger()`** – 14.14s ↳ **`complete_ledger()`** – 12.89s   • `sanitize_journal()` – 7.18s   • `revaluation_entries()` – 3.78s   • `sanitize_revaluations()` – 1.54s **Main Slow Operations** - **Heavy use of `deepcopy`** inside `precision()`, `price()`, `account_currency()` - **Frequent `DataFrame.apply()` and `.query()`** for validation logic **Optimization Suggestions** - Avoid or reduce `deepcopy` - Vectorize `pandas` operations - Cache repeat lookups (e.g., currency, precision)

Running kernprof -l -v profiling.py in commit 06a32c5:

Console output: ``` Total time: 0.637288 s File: /Users/lukas/macx/pyledger/pyledger/standalone_ledger.py Function: tax_entries at line 60 Line # Hits Time Per Hit % Time Line Contents ============================================================== 60 @profile 61 def tax_entries(self, df: pd.DataFrame) -> pd.DataFrame: 62 """Create journal entries to book tax according to tax_codes. 63 64 Iterates through the provided DataFrame and calculates tax for entries 65 that have a non-null tax_code. It generates a new journal entry for 66 each tax account. 67 68 Args: 69 df (pd.DataFrame): A pandas DataFrame containing journal entries. 70 71 Returns: 72 pd.DataFrame: A new DataFrame with tax journal entries. 73 Returns empty DataFrame with the correct structure if no tax codes are present. 74 """ 75 1 2016.0 2016.0 0.3 tax_definitions = self.tax_codes.list().set_index("id").to_dict("index") 76 1 0.0 0.0 0.0 tax_journal_entries = [] 77 1 112.0 112.0 0.0 accounts = self.accounts.list() 78 325 18174.0 55.9 2.9 for _, row in df.loc[df["tax_code"].notna()].iterrows(): 79 324 1899.0 5.9 0.3 tax = tax_definitions[row["tax_code"]] 80 324 20.0 0.1 0.0 account_tax_code = ( 81 966 66740.0 69.1 10.5 accounts.loc[ 82 322 31515.0 97.9 4.9 accounts["account"] == row["account"], "tax_code" 83 646 2956.0 4.6 0.5 ].values[0] if pd.notna(row["account"]) else None 84 ) 85 324 37.0 0.1 0.0 contra_tax_code = ( 86 966 64819.0 67.1 10.2 accounts.loc[ 87 322 31495.0 97.8 4.9 accounts["account"] == row["contra"], "tax_code" 88 646 3220.0 5.0 0.5 ].values[0] if pd.notna(row["contra"]) else None 89 ) 90 324 504.0 1.6 0.1 if pd.isna(account_tax_code) and pd.isna(contra_tax_code): 91 self._logger.warning( 92 f"Skip tax code '{row['tax_code']}' for {row['id']}: Neither account nor " 93 f"counter account have a tax_code." 94 ) 95 324 487.0 1.5 0.1 elif pd.isna(account_tax_code) and pd.notna(contra_tax_code): 96 182 13.0 0.1 0.0 multiplier = 1.0 97 215 984.0 4.6 0.2 account = (row["contra"] if tax["is_inclusive"] else row["account"]) 98 142 286.0 2.0 0.0 elif pd.notna(account_tax_code) and pd.isna(contra_tax_code): 99 142 19.0 0.1 0.0 multiplier = -1.0 100 145 782.0 5.4 0.1 account = (row["account"] if tax["is_inclusive"] else row["contra"]) 101 else: 102 self._logger.warning( 103 f"Skip tax code '{row['tax_code']}' for {row['id']}: Both account and " 104 f"counter accounts have tax_code." 105 ) 106 107 # Calculate tax amount 108 324 50.0 0.2 0.0 if tax["is_inclusive"]: 109 286 1321.0 4.6 0.2 amount = row["amount"] * tax["rate"] / (1 + tax["rate"]) 110 else: 111 38 176.0 4.6 0.0 amount = row["amount"] * tax["rate"] 112 324 49.0 0.2 0.0 amount = amount * multiplier 113 324 206202.0 636.4 32.4 amount = self.round_to_precision(amount, row["currency"]) 114 115 # Create a new journal entry for the tax amount 116 324 68.0 0.2 0.0 if amount != 0: 117 293 72.0 0.2 0.0 base_entry = { 118 293 1555.0 5.3 0.2 "date": row["date"], 119 293 1277.0 4.4 0.2 "description": "TAX: " + row["description"], 120 293 25.0 0.1 0.0 "account": account, 121 293 1210.0 4.1 0.2 "document": row["document"], 122 293 1191.0 4.1 0.2 "currency": row["currency"], 123 293 57.0 0.2 0.0 "report_amount": np.nan, 124 293 1169.0 4.0 0.2 "tax_code": row["tax_code"], 125 293 1172.0 4.0 0.2 "profit_center": row["profit_center"] 126 } 127 293 505.0 1.7 0.1 if pd.notna(tax["account"]): 128 586 387.0 0.7 0.1 tax_journal_entries.append(base_entry | { 129 293 1226.0 4.2 0.2 "id": f"{row['id']}:tax", 130 293 32.0 0.1 0.0 "contra": tax["account"], 131 293 18.0 0.1 0.0 "amount": amount, 132 879 182829.0 208.0 28.7 "report_amount": self.report_amount( 133 293 2325.0 7.9 0.4 amount=[amount], currency=[row["currency"]], date=[row["date"]] 134 293 35.0 0.1 0.0 )[0] 135 }) 136 293 557.0 1.9 0.1 if pd.notna(tax["contra"]): 137 amount = -1 * amount 138 tax_journal_entries.append(base_entry | { 139 "id": f"{row['id']}:tax", 140 "contra": tax["contra"], 141 "amount": amount, 142 "report_amount": self.report_amount( 143 amount=[amount], currency=[row["currency"]], date=[row["date"]] 144 )[0] 145 }) 146 1 7700.0 7700.0 1.2 result = enforce_schema(pd.DataFrame(tax_journal_entries), JOURNAL_SCHEMA) 147 148 1 2.0 2.0 0.0 return result Total time: 23.5822 s File: /Users/lukas/macx/pyledger/pyledger/standalone_ledger.py Function: _single_account_balance at line 153 Line # Hits Time Per Hit % Time Line Contents ============================================================== 153 @profile 154 def _single_account_balance( 155 self, account: int, profit_centers: list[str] | str = None, 156 start: datetime.date = None, end: datetime.date = None, 157 ) -> dict: 158 2 1846.0 923.0 0.0 return self._balance_from_serialized_ledger( 159 1 23580340.0 2e+07 100.0 self.serialized_ledger(), account=account, profit_centers=profit_centers, 160 1 1.0 1.0 0.0 start=start, end=end, 161 ) Total time: 21.4323 s File: /Users/lukas/macx/pyledger/pyledger/standalone_ledger.py Function: complete_ledger at line 175 Line # Hits Time Per Hit % Time Line Contents ============================================================== 175 @profile 176 def complete_ledger(self, journal=None) -> pd.DataFrame: 177 # Journal definition 178 1 12043.0 12043.0 0.1 df = self.journal.standardize(journal) 179 1 12144548.0 1e+07 56.7 df = self.sanitize_journal(df) 180 1 3267.0 3267.0 0.0 df = df.sort_values(["date", "id"]) 181 182 # Add ledger entries for tax 183 1 641387.0 641387.0 3.0 df = pd.concat([df, self.tax_entries(df)], ignore_index=True) 184 185 # Add ledger entries for (currency or other) revaluations 186 1 2464459.0 2e+06 11.5 revaluations = self.sanitize_revaluations(self.revaluations.list()) 187 1 6164663.0 6e+06 28.8 revalue = self.revaluation_entries(ledger=df, revaluations=revaluations) 188 1 1966.0 1966.0 0.0 return pd.concat([df, revalue], ignore_index=True) Total time: 6.15938 s File: /Users/lukas/macx/pyledger/pyledger/standalone_ledger.py Function: revaluation_entries at line 190 Line # Hits Time Per Hit % Time Line Contents ============================================================== 190 @profile 191 def revaluation_entries( 192 self, ledger: pd.DataFrame, revaluations: pd.DataFrame 193 ) -> pd.DataFrame: 194 """Compute ledger entries for (currency or other) revaluations""" 195 1 1.0 1.0 0.0 result = [] 196 1 10.0 10.0 0.0 reporting_currency = self.reporting_currency 197 49 968.0 19.8 0.0 for row in revaluations.to_dict("records"): 198 48 535592.0 11158.2 8.7 revalue = self.journal.standardize(pd.DataFrame(result)) 199 48 390267.0 8130.6 6.3 df = self.serialize_ledger(pd.concat([ledger, revalue])) 200 48 15.0 0.3 0.0 date = row["date"] 201 48 37582.0 783.0 0.6 accounts = self.parse_account_range(row["account"]) 202 48 249.0 5.2 0.0 accounts = set(accounts["add"]) - set(accounts["subtract"]) 203 204 2040 527.0 0.3 0.0 for account in accounts: 205 1992 913192.0 458.4 14.8 currency = self.account_currency(account) 206 1992 433.0 0.2 0.0 if currency != reporting_currency: 207 1248 1697116.0 1359.9 27.6 balance = self._balance_from_serialized_ledger(df, account, end=date) 208 1248 1542556.0 1236.0 25.0 fx_rate = self.price(currency, date=date, currency=reporting_currency) 209 1248 251.0 0.2 0.0 if fx_rate[0] != reporting_currency: 210 raise ValueError( 211 f"FX rate currency mismatch: expected {reporting_currency}, got " 212 f"{fx_rate[0]}" 213 ) 214 1248 317.0 0.3 0.0 target = balance[currency] * fx_rate[1] 215 1248 235.0 0.2 0.0 amount = target - balance["reporting_currency"] 216 1248 1019673.0 817.0 16.6 amount = self.round_to_precision(amount, ticker=reporting_currency, date=date) 217 1248 4366.0 3.5 0.1 id = f"revaluation:{date}:{account}" 218 1248 362.0 0.3 0.0 if amount != 0: 219 474 639.0 1.3 0.0 if pd.isna(row["credit"]) and pd.isna(row["debit"]): 220 raise ValueError("No account specified in revaluation entry {id}") 221 474 378.0 0.8 0.0 elif pd.isna(row["credit"]): 222 revaluation_account = row["debit"] 223 474 357.0 0.8 0.0 elif pd.isna(row["debit"]): 224 revaluation_account = row["credit"] 225 else: 226 474 133.0 0.3 0.0 revaluation_account = row["credit"] if amount >= 0 else row["debit"] 227 948 408.0 0.4 0.0 result.append({ 228 474 44.0 0.1 0.0 "id": id, 229 474 51.0 0.1 0.0 "date": date, 230 474 41.0 0.1 0.0 "account": account, 231 474 31.0 0.1 0.0 "currency": currency, 232 474 38.0 0.1 0.0 "amount": 0, 233 474 41.0 0.1 0.0 "report_amount": amount, 234 474 73.0 0.2 0.0 "description": row["description"] 235 }) 236 948 416.0 0.4 0.0 result.append({ 237 474 28.0 0.1 0.0 "id": id, 238 474 27.0 0.1 0.0 "date": date, 239 474 31.0 0.1 0.0 "account": revaluation_account, 240 474 38.0 0.1 0.0 "currency": reporting_currency, 241 474 113.0 0.2 0.0 "amount": -1 * amount, 242 474 47.0 0.1 0.0 "report_amount": -1 * amount, 243 474 44.0 0.1 0.0 "description": row["description"] 244 }) 245 246 1 12689.0 12689.0 0.2 return self.journal.standardize(pd.DataFrame(result)) Total time: 1.68925 s File: /Users/lukas/macx/pyledger/pyledger/standalone_ledger.py Function: _balance_from_serialized_ledger at line 248 Line # Hits Time Per Hit % Time Line Contents ============================================================== 248 @profile 249 def _balance_from_serialized_ledger( 250 self, ledger: pd.DataFrame, account: int, profit_centers: list[str] | str = None, 251 start: datetime.date = None, end: datetime.date = None, 252 ) -> dict: 253 """Compute balance from serialized ledger. 254 255 Args: 256 ledger (DataFrame): General ledger in long format following JOURNAL_SCHEMA. 257 account (int): The account number. 258 date (datetime.date, optional): The date up to which the balance is computed. 259 Defaults to None. 260 profit_centers: (list[str], str): Filter for ledger entries. If not None, the result is 261 calculated only from ledger entries assigned to one 262 of the profit centers in the filter. 263 264 Returns: 265 dict: Dictionary containing the balance of the account in various currencies. 266 """ 267 1249 171570.0 137.4 10.2 rows = ledger["account"] == int(account) 268 1249 214.0 0.2 0.0 if profit_centers is not None: 269 if isinstance(profit_centers, str): 270 profit_centers = [profit_centers] 271 valid_profit_centers = set(self.profit_centers.list()["profit_center"]) 272 invalid_profit_centers = set(profit_centers) - valid_profit_centers 273 if invalid_profit_centers: 274 raise ValueError( 275 f"Profit centers: {', '.join(invalid_profit_centers)} do not exist." 276 ) 277 rows = rows & (ledger["profit_center"].isin(profit_centers)) 278 1249 169.0 0.1 0.0 if start is not None: 279 1 259.0 259.0 0.0 rows = rows & (ledger["date"] >= pd.Timestamp(start)) 280 1249 181.0 0.1 0.0 if end is not None: 281 1249 301068.0 241.0 17.8 rows = rows & (ledger["date"] <= pd.Timestamp(end)) 282 1249 229.0 0.2 0.0 cols = ["amount", "report_amount", "currency"] 283 1249 43158.0 34.6 2.6 if rows.sum() == 0: 284 577 95.0 0.2 0.0 result = {"reporting_currency": 0.0} 285 577 262167.0 454.4 15.5 currency = self.account_currency(account) 286 577 99.0 0.2 0.0 if currency is not None: 287 577 107.0 0.2 0.0 result[currency] = 0.0 288 else: 289 672 393878.0 586.1 23.3 sub = ledger.loc[rows, cols] 290 672 43484.0 64.7 2.6 report_amount = sub["report_amount"].sum() 291 672 326959.0 486.5 19.4 account_currency = self.account_currency(account) 292 672 971.0 1.4 0.1 if pd.isna(account_currency): 293 amount = sub.groupby("currency").agg({"amount": "sum"}) 294 amount = {currency: amount 295 for currency, amount in zip(amount.index, amount["amount"])} 296 672 8031.0 12.0 0.5 elif account_currency == self.reporting_currency: 297 1 9.0 9.0 0.0 amount = {self.reporting_currency: report_amount} 298 671 100292.0 149.5 5.9 elif not all(sub["currency"] == account_currency): 299 raise ValueError(f"Unexpected currencies in transactions for account {account}.") 300 else: 301 671 34908.0 52.0 2.1 amount = {account_currency: sub["amount"].sum()} 302 672 218.0 0.3 0.0 result = {"reporting_currency": report_amount} | amount 303 1249 1179.0 0.9 0.1 return result ```
Comment by Chat GPT: Here’s a focused and detailed summary of your latest `kernprof` profiling output, organized by function with key hotspots and potential for optimization: ## 🔍 **Key Function Summaries** ### 1. **`_single_account_balance()`** - **Total time:** **23.6s** (100% of function time is in `self.serialized_ledger()`) - This is just a wrapper around: ```python return self._balance_from_serialized_ledger(self.serialized_ledger(), ...) ``` > **Main insight:** Performance depends entirely on `serialized_ledger()` and its components. ### 2. **`complete_ledger()`** - **Total time:** **21.4s** - **Main cost centers:** - `sanitize_journal()` – **12.1s** (56.7%) - `revaluation_entries()` – **6.2s** (28.8%) - `sanitize_revaluations()` – **2.5s** (11.5%) - `tax_entries()` – **0.6s** > **Note:** Revaluations and journal sanitization dominate cost. ### 3. **`revaluation_entries()`** - **Total time:** **6.2s** - **Heavyweights:** - `_balance_from_serialized_ledger()` – **27.6%** - `price()` – **25%** - `round_to_precision()` – **16.6%** - Several `standardize()` and `serialize_ledger()` calls (~15%) > **Hotspot:** Called **1248x**, which amplifies even small inefficiencies. ### 4. **`tax_entries()`** - **Total time:** **0.64s** - **Main cost:** `round_to_precision()` (**32.4%**) and `report_amount()` (**28.7%**) > **Efficient overall**, but still suffers from slow rounding and reporting calls. ### 5. **`_balance_from_serialized_ledger()`** - **Total time:** **1.7s** - **Expensive lines:** - `ledger.loc[rows, cols]` – **23.3%** - `self.account_currency(account)` – **15.5–19.4%** - Date filtering – **17.8%** > **Observation:** Pandas filtering and currency lookup dominate. ## 🔧 **Key Performance Bottlenecks** | Area | Notes | |-------------------------------|-------| | **`self.account_currency`** | Appears frequently and is expensive (~15–19% per call block). | | **`round_to_precision()`** | Used across functions; heavy due to possible `deepcopy`. | | **`price()`** | Called repeatedly in revaluations; possibly unoptimized. | | **Pandas filtering & `.loc`** | Present in most logic-heavy sections; slows `balance` & `tax` methods. | | **Repetitive standardization** | Redundant calls in loops (e.g. `self.journal.standardize` per iteration) add overhead. | ## 🛠️ **Optimization Opportunities** 1. **Memoize `account_currency()`, `price()`, `precision()`** - These are repeatedly accessed with the same input. 2. **Vectorize `tax_entries()` logic** - Currently iterating through rows with `df.iterrows()`. 3. **Avoid repeated `standardize()` / `serialize_ledger()` in loops** - Preprocess once or lazily evaluate if possible. 4. **Reduce `round_to_precision()` overhead** - Check if `deepcopy` is strictly necessary.
lasuk commented 2 weeks ago

Profiling sanitize_journal in commit 06a32c5 on my MacBook Air M2:

content of `profiling.py` ```python from pyinstrument import Profiler from pyledger import TextLedger engine = TextLedger("~/macx/accounts") df = engine.journal.list() profiler = Profiler() profiler.start() engine.sanitize_journal(df) profiler.stop() print(profiler.output_text(unicode=True, color=True)) ```
Profiling output ``` _ ._ __/__ _ _ _ _ _/_ Recorded: 07:44:33 Samples: 6817 /_//_/// /_\ / //_// / //_'/ // Duration: 6.897 CPU time: 6.861 / _/ v5.0.1 Profile at /Users/lukas/macx/pyledger/profiling.py:8 6.897 profiling.py:1 └─ 6.897 TextLedger.sanitize_journal pyledger/ledger_engine.py:850 ├─ 2.085 TextLedger._fill_report_amounts pyledger/ledger_engine.py:1068 │ └─ 2.072 TextLedger.report_amount pyledger/ledger_engine.py:1274 │ ├─ 1.116 TextLedger.round_to_precision pyledger/ledger_engine.py:1224 │ │ └─ 1.110 round_scalar pyledger/ledger_engine.py:1247 │ │ └─ 1.094 TextLedger.precision pyledger/ledger_engine.py:1467 │ │ ├─ 0.403 wrapper pyledger/decorators.py:22 │ │ │ └─ 0.400 deepcopy copy.py:118 │ │ │ [8 frames hidden] copy, pandas │ │ ├─ 0.250 Series.new_method pandas/core/ops/common.py:62 │ │ │ [5 frames hidden] pandas │ │ ├─ 0.127 Series.__getitem__ pandas/core/series.py:1095 │ │ ├─ 0.104 DataFrame.__getitem__ pandas/core/frame.py:4062 │ │ │ [2 frames hidden] pandas │ │ ├─ 0.098 Series.isna pandas/core/series.py:5773 │ │ │ [3 frames hidden] pandas │ │ └─ 0.072 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ └─ 0.944 TextLedger.price pyledger/ledger_engine.py:1351 │ ├─ 0.456 wrapper pyledger/decorators.py:22 │ │ └─ 0.448 deepcopy copy.py:118 │ │ [9 frames hidden] copy, pandas │ ├─ 0.220 DatetimeProperties.f pandas/core/accessor.py:111 │ │ [6 frames hidden] pandas │ └─ 0.123 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ [2 frames hidden] pandas ├─ 1.946 TextLedger._invalid_currency pyledger/ledger_engine.py:973 │ └─ 1.946 DataFrame.apply pandas/core/frame.py:10170 │ [2 frames hidden] pandas │ 1.945 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ └─ 1.917 is_invalid pyledger/ledger_engine.py:975 │ └─ 1.770 TextLedger.account_currency pyledger/ledger_engine.py:438 │ ├─ 0.760 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ [19 frames hidden] pandas │ ├─ 0.442 wrapper pyledger/decorators.py:22 │ │ └─ 0.430 deepcopy copy.py:118 │ │ [8 frames hidden] pandas │ ├─ 0.294 Series.new_method pandas/core/ops/common.py:62 │ │ [4 frames hidden] pandas │ └─ 0.171 DataFrame.__getitem__ pandas/core/frame.py:4062 │ [3 frames hidden] pandas ├─ 1.427 TextLedger._invalid_assets pyledger/ledger_engine.py:951 │ └─ 1.427 DataFrame.apply pandas/core/frame.py:10170 │ [2 frames hidden] pandas │ 1.426 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ └─ 1.406 is_invalid pyledger/ledger_engine.py:953 │ └─ 1.370 TextLedger.precision pyledger/ledger_engine.py:1467 │ ├─ 0.461 wrapper pyledger/decorators.py:22 │ │ └─ 0.454 deepcopy copy.py:118 │ │ [8 frames hidden] copy, pandas │ ├─ 0.350 Series.new_method pandas/core/ops/common.py:62 │ │ [9 frames hidden] pandas │ ├─ 0.177 Series.__getitem__ pandas/core/series.py:1095 │ │ [2 frames hidden] pandas │ ├─ 0.125 Series.isna pandas/core/series.py:5773 │ │ [4 frames hidden] pandas │ ├─ 0.120 DataFrame.__getitem__ pandas/core/frame.py:4062 │ │ [2 frames hidden] pandas │ └─ 0.083 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 ├─ 1.073 TextLedger._invalid_prices pyledger/ledger_engine.py:1003 │ └─ 1.073 DataFrame.apply pandas/core/frame.py:10170 │ [2 frames hidden] pandas │ 1.072 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ └─ 1.059 is_invalid pyledger/ledger_engine.py:1005 │ └─ 1.007 TextLedger.price pyledger/ledger_engine.py:1351 │ ├─ 0.561 wrapper pyledger/decorators.py:22 │ │ ├─ 0.465 deepcopy copy.py:118 │ │ │ [9 frames hidden] copy, pandas │ │ └─ 0.093 TextLedger._prices_as_dict_of_df pyledger/ledger_engine.py:1396 │ │ └─ 0.086 TextLedger.sanitize_prices pyledger/ledger_engine.py:1305 │ │ └─ 0.084 invalid_asset_reference pyledger/ledger_engine.py:1321 │ │ └─ 0.084 is_valid pyledger/ledger_engine.py:1323 │ │ └─ 0.084 TextLedger.precision pyledger/ledger_engine.py:1467 │ ├─ 0.189 DatetimeProperties.f pandas/core/accessor.py:111 │ │ [8 frames hidden] pandas │ ├─ 0.136 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ [5 frames hidden] pandas │ └─ 0.074 Series.new_method pandas/core/ops/common.py:62 │ [2 frames hidden] pandas ├─ 0.193 TextLedger._invalid_tax_codes pyledger/ledger_engine.py:907 │ └─ 0.189 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:411 │ └─ 0.179 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 │ └─ 0.174 validate_currency pyledger/ledger_engine.py:375 │ └─ 0.174 DataFrame.apply pandas/core/frame.py:10170 │ [2 frames hidden] pandas │ 0.174 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ └─ 0.173 is_valid pyledger/ledger_engine.py:377 │ └─ 0.172 TextLedger.precision pyledger/ledger_engine.py:1467 │ └─ 0.069 wrapper pyledger/decorators.py:22 └─ 0.161 TextLedger._invalid_accounts pyledger/ledger_engine.py:919 └─ 0.160 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:411 └─ 0.157 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 └─ 0.152 validate_currency pyledger/ledger_engine.py:375 └─ 0.152 DataFrame.apply pandas/core/frame.py:10170 [2 frames hidden] pandas 0.152 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 └─ 0.151 is_valid pyledger/ledger_engine.py:377 └─ 0.149 TextLedger.precision pyledger/ledger_engine.py:1467 ```
(Concise) Chat GPT summary: Here’s a **concise structured summary** of your profiling output: # **Overall Summary** - **Total duration:** 6.897 seconds - **CPU time:** 6.861 seconds - **Samples collected:** 6817 - **Primary execution:** Entire time is spent in `TextLedger.sanitize_journal`. # **Key Performance Bottlenecks** The `TextLedger.sanitize_journal` method dominates the runtime, and its major subcomponents are: | Subcomponent | Time (seconds) | Notes | |:-------------|:---------------|:------| | `TextLedger._fill_report_amounts` | 2.085 | Mostly time spent in rounding precision and fetching prices. | | `TextLedger._invalid_currency` | 1.946 | Heavy usage of `DataFrame.apply` and deep copying operations. | | `TextLedger._invalid_assets` | 1.427 | Again bottlenecked by DataFrame operations and precision checks. | | `TextLedger._invalid_prices` | 1.073 | Significant time spent fetching and validating prices. | | `TextLedger._invalid_tax_codes` | 0.193 | Smaller contributor but still notable. | | `TextLedger._invalid_accounts` | 0.161 | Smallest among major contributors. | # **Detailed Observations** - **Deepcopy overhead:** Deep copying (`copy.deepcopy`) inside decorated functions (`wrapper` at `decorators.py:22`) consumes substantial time across different methods, particularly in precision fetching and price retrieval. - **Pandas overhead:** Operations on Pandas objects (`DataFrame.apply`, `Series.__getitem__`, `Series.new_method`, `isna`, and indexing) are significant hotspots, especially in: - Checking invalid currencies - Validating assets - Fetching and checking prices - **Precision and Price Checks:** Calls to `TextLedger.precision` and `TextLedger.price` are heavily nested and repeatedly triggered, and each involves expensive operations. # **Quick High-Level Optimization Ideas** - **Avoid unnecessary deep copies** — review where deepcopy could be replaced or eliminated. - **Batch operations on DataFrames** instead of applying row-by-row (`apply`) where possible. - **Memoize expensive computations** like `precision()` and `price()` if their results are reusable. - **Optimize decorated functions** — the wrapper overhead indicates possible inefficiencies.
AlexTheWizardL commented 2 weeks ago

I've updated the branch with new sections that are now analyzed line-by-line at commit 01e11d0e0fe4b63e002445446e07c476ccaf6346.

The command to run profiling remains the same:

kernprof -l -v profiling.py

I've posted my profiling results below for comparison:

Profiling output ```terminal (.dev) oleksandrstepanenko@Alex-S pyledger % kernprof -l -v profiling.py _ ._ __/__ _ _ _ _ _/_ Recorded: 18:48:28 Samples: 1941 /_//_/// /_\ / //_// / //_'/ // Duration: 2.101 CPU time: 1.782 / _/ v5.0.1 Profile at profiling.py:8 2.101 profiling.py:1 └─ 2.101 wrapper line_profiler/line_profiler.py:146 └─ 2.101 TextLedger.sanitize_journal pyledger/ledger_engine.py:852 ├─ 1.209 wrapper line_profiler/line_profiler.py:146 │ ├─ 0.318 TextLedger._invalid_tax_codes pyledger/ledger_engine.py:910 │ │ └─ 0.314 wrapper line_profiler/line_profiler.py:146 │ │ └─ 0.314 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:412 │ │ ├─ 0.241 wrapper line_profiler/line_profiler.py:146 │ │ │ └─ 0.240 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 │ │ │ ├─ 0.205 validate_currency pyledger/ledger_engine.py:376 │ │ │ │ └─ 0.205 DataFrame.apply pandas/core/frame.py:10170 │ │ │ │ [2 frames hidden] pandas │ │ │ │ 0.204 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ │ │ └─ 0.198 is_valid pyledger/ledger_engine.py:378 │ │ │ │ └─ 0.198 wrapper line_profiler/line_profiler.py:146 │ │ │ │ └─ 0.198 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ │ │ ├─ 0.130 wrapper pyledger/decorators.py:22 │ │ │ │ │ ├─ 0.069 wrapper line_profiler/line_profiler.py:146 │ │ │ │ │ │ └─ 0.069 TextLedger._assets_as_dict_of_df pyledger/ledger_engine.py:1453 │ │ │ │ │ │ └─ 0.026 wrapper pyledger/decorators.py:22 │ │ │ │ │ │ └─ 0.025 CSVAccountingEntity.list pyledger/storage_entity.py:468 │ │ │ │ │ │ └─ 0.025 CSVAccountingEntity._read_data pyledger/storage_entity.py:493 │ │ │ │ │ └─ 0.060 deepcopy copy.py:119 │ │ │ │ │ [5 frames hidden] copy, pandas │ │ │ │ └─ 0.027 Series.new_method pandas/core/ops/common.py:62 │ │ │ └─ 0.031 enforce_schema consistent_df/enforce_schema.py:22 │ │ │ └─ 0.031 _enforce_schema consistent_df/enforce_schema.py:91 │ │ ├─ 0.050 wrapper pyledger/decorators.py:22 │ │ │ └─ 0.049 CSVAccountingEntity.list pyledger/storage_entity.py:468 │ │ │ └─ 0.049 CSVAccountingEntity._read_data pyledger/storage_entity.py:493 │ │ │ └─ 0.038 CSVAccountingEntity.standardize pyledger/storage_entity.py:51 │ │ │ └─ 0.034 enforce_schema consistent_df/enforce_schema.py:22 │ │ │ └─ 0.034 _enforce_schema consistent_df/enforce_schema.py:91 │ │ └─ 0.023 TextLedger.sanitize_tax_codes pyledger/ledger_engine.py:296 │ ├─ 0.316 TextLedger._invalid_assets pyledger/ledger_engine.py:956 │ │ └─ 0.314 DataFrame.apply pandas/core/frame.py:10170 │ │ [2 frames hidden] pandas │ │ 0.313 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ └─ 0.309 is_invalid pyledger/ledger_engine.py:959 │ │ └─ 0.307 wrapper line_profiler/line_profiler.py:146 │ │ └─ 0.304 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ ├─ 0.146 wrapper pyledger/decorators.py:22 │ │ │ └─ 0.146 deepcopy copy.py:119 │ │ │ [10 frames hidden] copy, pandas │ │ ├─ 0.060 Series.new_method pandas/core/ops/common.py:62 │ │ │ [4 frames hidden] pandas │ │ ├─ 0.028 Series.isna pandas/core/series.py:5773 │ │ │ [3 frames hidden] pandas │ │ └─ 0.026 Series.__getitem__ pandas/core/series.py:1095 │ ├─ 0.308 TextLedger._invalid_accounts pyledger/ledger_engine.py:923 │ │ └─ 0.296 wrapper line_profiler/line_profiler.py:146 │ │ └─ 0.296 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:412 │ │ ├─ 0.267 wrapper line_profiler/line_profiler.py:146 │ │ │ └─ 0.267 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 │ │ │ ├─ 0.210 validate_currency pyledger/ledger_engine.py:376 │ │ │ │ └─ 0.210 DataFrame.apply pandas/core/frame.py:10170 │ │ │ │ [2 frames hidden] pandas │ │ │ │ 0.207 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ │ │ └─ 0.202 is_valid pyledger/ledger_engine.py:378 │ │ │ │ └─ 0.198 wrapper line_profiler/line_profiler.py:146 │ │ │ │ └─ 0.196 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ │ │ ├─ 0.097 wrapper pyledger/decorators.py:22 │ │ │ │ │ └─ 0.096 deepcopy copy.py:119 │ │ │ │ │ [6 frames hidden] copy, pandas │ │ │ │ └─ 0.036 Series.new_method pandas/core/ops/common.py:62 │ │ │ │ [2 frames hidden] pandas │ │ │ └─ 0.049 enforce_schema consistent_df/enforce_schema.py:22 │ │ │ [3 frames hidden] consistent_df, pandas │ │ └─ 0.028 TextLedger.sanitize_tax_codes pyledger/ledger_engine.py:296 │ └─ 0.267 TextLedger._invalid_currency pyledger/ledger_engine.py:979 │ └─ 0.265 DataFrame.apply pandas/core/frame.py:10170 │ [2 frames hidden] pandas │ 0.264 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ └─ 0.257 is_invalid pyledger/ledger_engine.py:982 │ └─ 0.234 TextLedger.account_currency pyledger/ledger_engine.py:440 │ ├─ 0.112 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ [11 frames hidden] pandas │ ├─ 0.045 wrapper pyledger/decorators.py:22 │ │ └─ 0.044 deepcopy copy.py:119 │ │ [5 frames hidden] pandas │ ├─ 0.039 Series.new_method pandas/core/ops/common.py:62 │ │ [2 frames hidden] pandas │ └─ 0.027 DataFrame.__getitem__ pandas/core/frame.py:4062 │ [2 frames hidden] pandas ├─ 0.434 TextLedger._fill_report_amounts pyledger/ledger_engine.py:1075 │ ├─ 0.338 TextLedger.report_amount pyledger/ledger_engine.py:1281 │ │ ├─ 0.233 TextLedger.round_to_precision pyledger/ledger_engine.py:1231 │ │ │ └─ 0.232 round_scalar pyledger/ledger_engine.py:1254 │ │ │ └─ 0.231 wrapper line_profiler/line_profiler.py:146 │ │ │ └─ 0.229 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ │ ├─ 0.122 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.122 deepcopy copy.py:119 │ │ │ │ [6 frames hidden] copy, pandas │ │ │ └─ 0.040 Series.new_method pandas/core/ops/common.py:62 │ │ │ [2 frames hidden] pandas │ │ └─ 0.104 TextLedger.price pyledger/ledger_engine.py:1358 │ │ └─ 0.059 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ [5 frames hidden] pandas │ └─ 0.052 DataFrameGroupBy.aggregate pandas/core/groupby/generic.py:1420 │ [6 frames hidden] pandas ├─ 0.359 TextLedger._invalid_prices pyledger/ledger_engine.py:1010 │ └─ 0.357 DataFrame.apply pandas/core/frame.py:10170 │ [2 frames hidden] pandas │ 0.356 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ └─ 0.350 is_invalid pyledger/ledger_engine.py:1012 │ └─ 0.340 TextLedger.price pyledger/ledger_engine.py:1358 │ └─ 0.303 wrapper pyledger/decorators.py:22 │ └─ 0.289 TextLedger._prices_as_dict_of_df pyledger/ledger_engine.py:1403 │ ├─ 0.173 TextLedger.sanitize_prices pyledger/ledger_engine.py:1312 │ │ ├─ 0.139 invalid_asset_reference pyledger/ledger_engine.py:1328 │ │ │ └─ 0.135 is_valid pyledger/ledger_engine.py:1330 │ │ │ └─ 0.135 wrapper line_profiler/line_profiler.py:146 │ │ │ └─ 0.134 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ │ ├─ 0.066 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.065 deepcopy copy.py:119 │ │ │ │ [5 frames hidden] copy, pandas │ │ │ └─ 0.026 Series.new_method pandas/core/ops/common.py:62 │ │ └─ 0.032 enforce_schema consistent_df/enforce_schema.py:22 │ │ └─ 0.032 _enforce_schema consistent_df/enforce_schema.py:91 │ └─ 0.097 wrapper pyledger/decorators.py:22 │ └─ 0.096 CSVAccountingEntity.list pyledger/storage_entity.py:468 │ └─ 0.096 CSVAccountingEntity._read_data pyledger/storage_entity.py:493 │ └─ 0.081 CSVAccountingEntity.standardize pyledger/storage_entity.py:51 │ └─ 0.077 enforce_schema consistent_df/enforce_schema.py:22 │ [8 frames hidden] consistent_df, pandas ├─ 0.031 TextLedger._invalid_profit_centers pyledger/ledger_engine.py:1036 │ └─ 0.027 wrapper pyledger/decorators.py:22 │ └─ 0.027 CSVAccountingEntity.list pyledger/storage_entity.py:468 │ └─ 0.027 CSVAccountingEntity._read_data pyledger/storage_entity.py:493 └─ 0.021 enforce_schema consistent_df/enforce_schema.py:22 └─ 0.021 _enforce_schema consistent_df/enforce_schema.py:91 Wrote profile results to profiling.py.lprof Timer unit: 1e-06 s Total time: 0.505745 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: sanitize_accounts at line 357 Line # Hits Time Per Hit % Time Line Contents ============================================================== 357 @profile 358 def sanitize_accounts(self, df: pd.DataFrame, tax_codes: pd.DataFrame = None) -> pd.DataFrame: 359 """Discard incoherent account data. 360 361 Discard accounts with invalid currencies, and set invalid tax code references 362 to NA. Removed or modified entries are logged as warnings. 363 364 If no tax_codes DataFrame is provided, it will be fetched via the `tax_codes.list()` method, 365 as the validity of accounts may depend on referenced tax codes. 366 367 Args: 368 df (pd.DataFrame): The raw accounts DataFrame to validate. 369 tax_codes (pd.DataFrame, optional): Tax codes DataFrame for reference validation. 370 371 Returns: 372 pd.DataFrame: The sanitized DataFrame with valid account entries. 373 """ 374 4 79292.0 19823.0 15.7 df = enforce_schema(df, ACCOUNT_SCHEMA, keep_extra_columns=True) 375 376 4 14.0 3.5 0.0 def validate_currency() -> pd.Series: 377 """Validate that each account's currency is supported using the precision() method.""" 378 def is_valid(row): 379 try: 380 self.precision(ticker=row["currency"]) 381 return True 382 except ValueError: 383 return False 384 return df.apply(is_valid, axis=1) 385 386 # Validate currencies 387 4 414942.0 103735.5 82.0 valid_currency_mask = validate_currency() 388 4 377.0 94.2 0.1 if not valid_currency_mask.all(): 389 invalid = df.loc[~valid_currency_mask, "account"].tolist() 390 self._logger.warning( 391 f"Discarding {len(invalid)} accounts with invalid currencies: " 392 f"{first_elements_as_str(invalid)}." 393 ) 394 df = df.loc[valid_currency_mask] 395 396 # Validate tax codes 397 4 5.0 1.2 0.0 if tax_codes is None: 398 tax_codes = self.sanitize_tax_codes(self.tax_codes.list(), accounts=df) 399 4 963.0 240.8 0.2 valid_tax_codes = set(tax_codes["id"]) 400 401 4 8983.0 2245.8 1.8 invalid_tax_code_mask = df["tax_code"].notna() & ~df["tax_code"].isin(valid_tax_codes) 402 4 210.0 52.5 0.0 if invalid_tax_code_mask.any(): 403 invalid = df.loc[invalid_tax_code_mask, "account"].tolist() 404 self._logger.warning( 405 f"Found {len(invalid)} accounts with invalid tax code references: " 406 f"{first_elements_as_str(invalid)}. Setting 'tax_code' to NA for these entries." 407 ) 408 df.loc[invalid_tax_code_mask, "tax_code"] = pd.NA 409 410 4 959.0 239.8 0.2 return df.reset_index(drop=True) Total time: 0.609759 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: sanitized_accounts_tax_codes at line 412 Line # Hits Time Per Hit % Time Line Contents ============================================================== 412 @profile 413 def sanitized_accounts_tax_codes(self): 414 """Sanitize accounts and tax codes 415 416 Accounts and tax_codes are interdependent. This method validates both 417 entities, navigating their mutual dependencies in a multi-step process: 418 1. Sanitize accounts with raw tax_codes. 419 2. Sanitize tax_codes using the partially sanitized accounts. 420 3. Re-sanitize accounts with the validated tax_codes. 421 422 Logs warnings for each discarded or adjusted entry. 423 424 Returns: 425 Tuple[pd.DataFrame, pd.DataFrame]: The sanitized accounts and tax_codes DataFrames. 426 """ 427 # Step 1: Partial accounts sanitization 428 2 23241.0 11620.5 3.8 raw_accounts = self.accounts.list() 429 2 28029.0 14014.5 4.6 raw_tax_codes = self.tax_codes.list() 430 2 285611.0 142805.5 46.8 accounts_df_step1 = self.sanitize_accounts(raw_accounts, tax_codes=raw_tax_codes) 431 432 # Step 2: Sanitize tax_codes using partially sanitized accounts 433 2 51851.0 25925.5 8.5 tax_codes_df = self.sanitize_tax_codes(raw_tax_codes, accounts=accounts_df_step1) 434 435 # Step 3: Re-validate accounts with the now fully validated tax_codes 436 2 221024.0 110512.0 36.2 accounts_df_final = self.sanitize_accounts(raw_accounts, tax_codes=tax_codes_df) 437 438 2 3.0 1.5 0.0 return accounts_df_final, tax_codes_df Total time: 2.10072 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: sanitize_journal at line 852 Line # Hits Time Per Hit % Time Line Contents ============================================================== 852 @profile 853 def sanitize_journal(self, df: pd.DataFrame) -> pd.DataFrame: 854 """ 855 Discard incoherent journal data. 856 857 This method discards ledger transactions - entries in the journal data 858 frame with the same 'id' - that: 859 1. Do not balance to zero. 860 2. Span multiple dates. 861 3. Have neither 'account' nor 'contra' specified. 862 4. Omit a profit center when profit centers exist in the system. 863 5. Reference an invalid currency, account, contra account, or profit 864 center. 865 6. Do not match the currency of a referenced account or contra account, 866 and the account or contra account is not denominated in 867 reporting currency. 868 7. Lack a valid price reference when 'report_amount' is missing and 869 the entry is in a non-reporting currency. 870 871 Also, undefined tax code references are removed from journal entries. 872 873 A warning specifying the reason is logged for each discarded entry. 874 875 Args: 876 df (pd.DataFrame): Journal data to sanitize. 877 878 Returns: 879 pd.DataFrame: Sanitized journal data containing only valid entries. 880 """ 881 882 1 21469.0 21469.0 1.0 df = enforce_schema(df, JOURNAL_SCHEMA, keep_extra_columns=True) 883 884 1 2.0 2.0 0.0 invalid_ids = set() 885 1 7046.0 7046.0 0.3 invalid_ids = self._invalid_multidate_txns(df, invalid_ids) 886 1 317890.0 317890.0 15.1 self._invalid_tax_codes(df, invalid_ids) 887 1 308371.0 308371.0 14.7 invalid_ids = self._invalid_accounts(df, invalid_ids) 888 1 315454.0 315454.0 15.0 invalid_ids = self._invalid_assets(df, invalid_ids) 889 1 266966.0 266966.0 12.7 invalid_ids = self._invalid_currency(df, invalid_ids) 890 1 359301.0 359301.0 17.1 invalid_ids = self._invalid_prices(df, invalid_ids) 891 1 31714.0 31714.0 1.5 invalid_ids = self._invalid_profit_centers(df, invalid_ids) 892 1 433479.0 433479.0 20.6 df["report_amount"] = self._fill_report_amounts(df, invalid_ids) 893 1 18268.0 18268.0 0.9 invalid_ids = self._unbalanced_report_amounts(df, invalid_ids) 894 895 1 20755.0 20755.0 1.0 return df.query("id not in @invalid_ids").reset_index(drop=True) Total time: 0.317841 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_tax_codes at line 910 Line # Hits Time Per Hit % Time Line Contents ============================================================== 910 @profile 911 def _invalid_tax_codes(self, df: pd.DataFrame, invalid_ids: set) -> None: 912 """Drop undefined 'tax_code' references.""" 913 1 314021.0 314021.0 98.8 _, tax_codes = self.sanitized_accounts_tax_codes() 914 1 1483.0 1483.0 0.5 invalid_tax_code_mask = ~df["tax_code"].isna() & ~df["tax_code"].isin(set(tax_codes["id"])) 915 1 923.0 923.0 0.3 new_invalid_ids = set(df.loc[invalid_tax_code_mask, "id"]) - invalid_ids 916 1 0.0 0.0 0.0 if new_invalid_ids: 917 self._logger.warning( 918 f"Setting 'tax_code' to 'NA' for {len(new_invalid_ids)} journal entries " 919 f"with invalid tax codes: {first_elements_as_str(new_invalid_ids)}" 920 ) 921 1 1414.0 1414.0 0.4 df.loc[invalid_tax_code_mask, "tax_code"] = pd.NA Total time: 0.308293 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_accounts at line 923 Line # Hits Time Per Hit % Time Line Contents ============================================================== 923 @profile 924 def _invalid_accounts(self, df: pd.DataFrame, invalid_ids: set) -> set: 925 """ 926 Mark transactions with invalid accounts: 927 - both 'account' and 'contra' are missing, or 928 - a referenced account is not defined.""" 929 1 295801.0 295801.0 95.9 accounts, _ = self.sanitized_accounts_tax_codes() 930 1 204.0 204.0 0.1 accounts_set = set(accounts["account"]) 931 932 # Transactions missing both 'account' and 'contra' 933 1 1748.0 1748.0 0.6 missing_mask = df["account"].isna() & df["contra"].isna() 934 1 1442.0 1442.0 0.5 missing_ids = set(df.loc[missing_mask, "id"]) - invalid_ids 935 1 1.0 1.0 0.0 if missing_ids: 936 self._logger.warning( 937 f"Discarding {len(missing_ids)} journal entries with neither 'account' " 938 f"nor 'contra' specified: {first_elements_as_str(missing_ids)}" 939 ) 940 invalid_ids = invalid_ids.union(missing_ids) 941 942 # Transactions with invalid account or contra references 943 1 3479.0 3479.0 1.1 invalid_account_mask = ~df["account"].isna() & ~df["account"].isin(accounts_set) 944 1 3362.0 3362.0 1.1 invalid_contra_mask = ~df["contra"].isna() & ~df["contra"].isin(accounts_set) 945 1 739.0 739.0 0.2 ref_mask = invalid_account_mask | invalid_contra_mask 946 1 1514.0 1514.0 0.5 ref_ids = set(df.loc[ref_mask, "id"]) - invalid_ids - missing_ids 947 1 1.0 1.0 0.0 if ref_ids: 948 self._logger.warning( 949 f"Discarding {len(ref_ids)} journal entries with invalid account " 950 f"or contra references: {first_elements_as_str(ref_ids)}" 951 ) 952 invalid_ids = invalid_ids.union(ref_ids) 953 954 1 2.0 2.0 0.0 return invalid_ids Total time: 0.315418 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_assets at line 956 Line # Hits Time Per Hit % Time Line Contents ============================================================== 956 @profile 957 def _invalid_assets(self, df: pd.DataFrame, invalid_ids: set) -> set: 958 """Mark transactions with invalid asset references.""" 959 1 2.0 2.0 0.0 def is_invalid(row): 960 if row["id"] in invalid_ids: 961 return True 962 try: 963 self.precision(ticker=row["currency"], date=row["date"]) 964 return False 965 except ValueError: 966 return True 967 968 1 313400.0 313400.0 99.4 row_mask = df.apply(is_invalid, axis=1, result_type="reduce") 969 1 2013.0 2013.0 0.6 new_invalid_ids = set(df.loc[row_mask, "id"]) - invalid_ids 970 1 1.0 1.0 0.0 if new_invalid_ids: 971 self._logger.warning( 972 f"Discarding {len(new_invalid_ids)} journal entries with invalid currency: " 973 f"{first_elements_as_str(new_invalid_ids)}" 974 ) 975 invalid_ids = invalid_ids.union(new_invalid_ids) 976 977 1 2.0 2.0 0.0 return invalid_ids Total time: 0.266922 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_currency at line 979 Line # Hits Time Per Hit % Time Line Contents ============================================================== 979 @profile 980 def _invalid_currency(self, df: pd.DataFrame, invalid_ids: set) -> set: 981 """Mark transactions with currency mismatched to account or contra account.""" 982 1 3.0 3.0 0.0 def is_invalid(row): 983 if row["id"] in invalid_ids: 984 return True 985 if row["amount"] == 0: 986 return False 987 if pd.notna(row["account"]): 988 account_currency = self.account_currency(row["account"]) 989 if ((account_currency != self.reporting_currency) 990 and (row["currency"] != account_currency)): 991 return True 992 if pd.notna(row["contra"]): 993 contra_currency = self.account_currency(row["contra"]) 994 if ((contra_currency != self.reporting_currency) 995 and (row["currency"] != contra_currency)): 996 return True 997 return False 998 999 1 265205.0 265205.0 99.4 invalid_currency = df.apply(is_invalid, axis=1, result_type='reduce') 1000 1 1711.0 1711.0 0.6 new_invalid_ids = set(df.loc[invalid_currency, "id"]) - invalid_ids 1001 1 1.0 1.0 0.0 if new_invalid_ids: 1002 self._logger.warning( 1003 f"Discarding {len(new_invalid_ids)} journal entries with mismatched transaction " 1004 f"currency: {first_elements_as_str(new_invalid_ids)}" 1005 ) 1006 invalid_ids = invalid_ids.union(new_invalid_ids) 1007 1008 1 2.0 2.0 0.0 return invalid_ids Total time: 0.069579 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: _assets_as_dict_of_df at line 1453 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1453 @property 1454 @timed_cache(120) 1455 @profile 1456 def _assets_as_dict_of_df(self) -> Dict[str, pd.DataFrame]: 1457 """Organize assets by ticker for quick access. 1458 1459 Splits assets by ticker for efficient lookup of increments by ticker 1460 and date. 1461 1462 Returns: 1463 Dict[str, pd.DataFrame]: Maps each asset ticker to a DataFrame of 1464 its `increment` history, sorted by `date` with `NaT` values first. 1465 """ 1466 2 1.0 0.5 0.0 return { 1467 22 28.0 1.3 0.0 ticker: ( 1468 11 14638.0 1330.7 21.0 group[["date", "increment"]] 1469 11 6446.0 586.0 9.3 .sort_values("date", na_position="first") 1470 11 1745.0 158.6 2.5 .reset_index(drop=True) 1471 ) 1472 13 46721.0 3593.9 67.1 for ticker, group in self.sanitize_assets(self.assets.list()).groupby("ticker") 1473 } Total time: 1.07578 s File: /Users/oleksandrstepanenko/macx/pyledger/pyledger/ledger_engine.py Function: precision at line 1475 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1475 @profile 1476 def precision(self, ticker: str, date: datetime.date = None) -> float: 1477 """Returns the smallest price increment of an asset or currency. 1478 1479 This is the precision, to which prices should be rounded. 1480 1481 Args: 1482 ticker (str): Identifier of the currency or asset. 1483 date (datetime.date, optional): Date for which to retrieve the precision. 1484 Defaults to today's date. 1485 1486 Returns: 1487 float: The smallest price increment. 1488 """ 1489 1490 165 302.0 1.8 0.0 if ticker == "reporting_currency": 1491 ticker = self.reporting_currency 1492 1493 165 176.0 1.1 0.0 if date is None: 1494 70 1177.0 16.8 0.1 date = datetime.date.today() 1495 95 251.0 2.6 0.0 elif not isinstance(date, datetime.date): 1496 date = pd.to_datetime(date).date() 1497 1498 165 576958.0 3496.7 53.6 asset = self._assets_as_dict_of_df.get(ticker) 1499 165 256.0 1.6 0.0 if asset is None: 1500 # Asset is not defined by the user, fall back to hard-coded defaults 1501 increment = DEFAULT_ASSETS.loc[DEFAULT_ASSETS["ticker"] == ticker, "increment"] 1502 if len(increment) < 1: 1503 raise ValueError(f"No asset definition available for ticker '{ticker}'.") 1504 if len(increment) > 1: 1505 raise ValueError(f"Multiple default definitions for asset '{ticker}'.") 1506 return increment.item() 1507 else: 1508 165 347391.0 2105.4 32.3 mask = asset["date"].isna() | (asset["date"] <= pd.Timestamp(date)) 1509 165 12699.0 77.0 1.2 if not mask.any(): 1510 raise ValueError( 1511 f"No asset definition available for '{ticker}' on or before {date}." 1512 ) 1513 165 136573.0 827.7 12.7 return asset.loc[mask[mask].index[-1], "increment"].item() (.dev) oleksandrstepanenko@Alex-S pyledger % ```
Comment by ChatGPT **Total Time:** 2.101s **Main Focus:** `sanitize_journal` | Component | Time (s) | Notes | |:----------------------------|:--------:|:------| | `_invalid_tax_codes` | 0.318 | Resanitizes accounts/tax codes (heavy precision lookup). | | `_invalid_assets` | 0.316 | `.apply()` over rows, precision checks. | | `_invalid_accounts` | 0.308 | Accounts validation with precision usage. | | `_invalid_prices` | 0.359 | Price validation, uses `sanitize_prices` + precision. | | `_invalid_currency` | 0.267 | Currency mismatch checks (no direct precision). | | `_fill_report_amounts` | 0.434 | Computes `report_amount` with rounding and price. | **Notes:** - All methods except `_invalid_currency` use `precision()`, often indirectly via `round_to_precision`, `sanitize_prices`, `account_currency`, or asset validation. - Precision overhead dominates: `precision()` involves `timed_cache`, deep copies, and per-call lookups on sanitized assets. - `_assets_as_dict_of_df` is still costly inside `precision`, despite caching, due to per-call deep copying and groupby operations. - Deepcopy bottlenecks persist across cached data, `enforce_schema`, and DataFrame `.apply()` traversals. - Heavy chaining of wrappers (e.g., decorator → profiling → cache → deepcopy) amplifies even "small" operations inside critical loops.

Based on my analysis, it appears that the primary bottleneck is now the precision() method, which is called in all other private submethods. However, let's first confirm this by reviewing your profiling results as well.

lasuk commented 2 weeks ago

Running the same command on my MacBook Air M2, using the full macxred accounting data:

Profiling Output ``` _ ._ __/__ _ _ _ _ _/_ Recorded: 21:17:07 Samples: 12014 /_//_/// /_\ / //_// / //_'/ // Duration: 12.131 CPU time: 12.101 / _/ v5.0.1 Profile at profiling.py:8 12.130 profiling.py:1 └─ 12.130 wrapper line_profiler/line_profiler.py:146 └─ 12.130 TextLedger.sanitize_journal pyledger/ledger_engine.py:852 ├─ 6.617 wrapper line_profiler/line_profiler.py:146 │ ├─ 3.339 TextLedger._invalid_currency pyledger/ledger_engine.py:979 │ │ └─ 3.339 DataFrame.apply pandas/core/frame.py:10170 │ │ [2 frames hidden] pandas │ │ 3.338 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ └─ 3.306 is_invalid pyledger/ledger_engine.py:982 │ │ ├─ 3.058 TextLedger.account_currency pyledger/ledger_engine.py:440 │ │ │ ├─ 1.499 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ │ │ [22 frames hidden] pandas │ │ │ ├─ 0.574 wrapper pyledger/decorators.py:22 │ │ │ │ └─ 0.558 deepcopy copy.py:118 │ │ │ │ [7 frames hidden] pandas │ │ │ ├─ 0.515 Series.new_method pandas/core/ops/common.py:62 │ │ │ │ [4 frames hidden] pandas │ │ │ └─ 0.370 DataFrame.__getitem__ pandas/core/frame.py:4062 │ │ │ [3 frames hidden] pandas │ │ └─ 0.132 Series.__getitem__ pandas/core/series.py:1095 │ ├─ 2.543 TextLedger._invalid_assets pyledger/ledger_engine.py:956 │ │ └─ 2.542 DataFrame.apply pandas/core/frame.py:10170 │ │ [2 frames hidden] pandas │ │ 2.542 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ └─ 2.515 is_invalid pyledger/ledger_engine.py:959 │ │ └─ 2.458 wrapper line_profiler/line_profiler.py:146 │ │ └─ 2.450 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ ├─ 0.902 wrapper pyledger/decorators.py:22 │ │ │ └─ 0.898 deepcopy copy.py:118 │ │ │ [9 frames hidden] copy, pandas │ │ ├─ 0.637 Series.new_method pandas/core/ops/common.py:62 │ │ │ [9 frames hidden] pandas │ │ ├─ 0.249 Series.__getitem__ pandas/core/series.py:1095 │ │ │ └─ 0.142 check_bool_indexer pandas/core/indexing.py:2632 │ │ ├─ 0.218 DataFrame.__getitem__ pandas/core/frame.py:4062 │ │ │ [2 frames hidden] pandas │ │ ├─ 0.216 Series.isna pandas/core/series.py:5773 │ │ │ [4 frames hidden] pandas │ │ └─ 0.148 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ ├─ 0.412 TextLedger._invalid_tax_codes pyledger/ledger_engine.py:910 │ │ └─ 0.409 wrapper line_profiler/line_profiler.py:146 │ │ └─ 0.409 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:412 │ │ └─ 0.389 wrapper line_profiler/line_profiler.py:146 │ │ └─ 0.389 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 │ │ └─ 0.377 validate_currency pyledger/ledger_engine.py:376 │ │ └─ 0.377 DataFrame.apply pandas/core/frame.py:10170 │ │ [2 frames hidden] pandas │ │ 0.376 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ │ └─ 0.372 is_valid pyledger/ledger_engine.py:378 │ │ └─ 0.372 wrapper line_profiler/line_profiler.py:146 │ │ └─ 0.372 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ └─ 0.134 wrapper pyledger/decorators.py:22 │ └─ 0.323 TextLedger._invalid_accounts pyledger/ledger_engine.py:923 │ └─ 0.321 wrapper line_profiler/line_profiler.py:146 │ └─ 0.321 TextLedger.sanitized_accounts_tax_codes pyledger/ledger_engine.py:412 │ └─ 0.315 wrapper line_profiler/line_profiler.py:146 │ └─ 0.315 TextLedger.sanitize_accounts pyledger/ledger_engine.py:357 │ └─ 0.306 validate_currency pyledger/ledger_engine.py:376 │ └─ 0.306 DataFrame.apply pandas/core/frame.py:10170 │ [2 frames hidden] pandas │ 0.305 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 │ └─ 0.302 is_valid pyledger/ledger_engine.py:378 │ └─ 0.299 wrapper line_profiler/line_profiler.py:146 │ └─ 0.299 TextLedger.precision pyledger/ledger_engine.py:1475 ├─ 3.654 TextLedger._fill_report_amounts pyledger/ledger_engine.py:1075 │ └─ 3.642 TextLedger.report_amount pyledger/ledger_engine.py:1281 │ ├─ 2.096 TextLedger.round_to_precision pyledger/ledger_engine.py:1231 │ │ └─ 2.091 round_scalar pyledger/ledger_engine.py:1254 │ │ └─ 2.076 wrapper line_profiler/line_profiler.py:146 │ │ └─ 2.066 TextLedger.precision pyledger/ledger_engine.py:1475 │ │ ├─ 0.757 wrapper pyledger/decorators.py:22 │ │ │ └─ 0.754 deepcopy copy.py:118 │ │ │ [10 frames hidden] copy, pandas │ │ ├─ 0.531 Series.new_method pandas/core/ops/common.py:62 │ │ │ [8 frames hidden] pandas │ │ ├─ 0.238 Series.__getitem__ pandas/core/series.py:1095 │ │ │ └─ 0.146 check_bool_indexer pandas/core/indexing.py:2632 │ │ ├─ 0.179 DataFrame.__getitem__ pandas/core/frame.py:4062 │ │ │ [2 frames hidden] pandas │ │ └─ 0.177 Series.isna pandas/core/series.py:5773 │ │ [3 frames hidden] pandas │ └─ 1.529 TextLedger.price pyledger/ledger_engine.py:1358 │ ├─ 0.754 wrapper pyledger/decorators.py:22 │ │ └─ 0.747 deepcopy copy.py:118 │ │ [9 frames hidden] copy, pandas │ ├─ 0.307 DatetimeProperties.f pandas/core/accessor.py:111 │ │ [6 frames hidden] pandas │ ├─ 0.262 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ │ [5 frames hidden] pandas │ └─ 0.135 Series.new_method pandas/core/ops/common.py:62 │ [2 frames hidden] pandas └─ 1.840 TextLedger._invalid_prices pyledger/ledger_engine.py:1010 └─ 1.840 DataFrame.apply pandas/core/frame.py:10170 [2 frames hidden] pandas 1.838 FrameColumnApply.apply_series_generator pandas/core/apply.py:1070 └─ 1.826 is_invalid pyledger/ledger_engine.py:1012 └─ 1.729 TextLedger.price pyledger/ledger_engine.py:1358 ├─ 0.956 wrapper pyledger/decorators.py:22 │ ├─ 0.787 deepcopy copy.py:118 │ │ [10 frames hidden] copy, pandas │ └─ 0.165 TextLedger._prices_as_dict_of_df pyledger/ledger_engine.py:1403 │ └─ 0.154 TextLedger.sanitize_prices pyledger/ledger_engine.py:1312 │ └─ 0.149 invalid_asset_reference pyledger/ledger_engine.py:1328 │ └─ 0.147 is_valid pyledger/ledger_engine.py:1330 │ └─ 0.146 wrapper line_profiler/line_profiler.py:146 │ └─ 0.145 TextLedger.precision pyledger/ledger_engine.py:1475 ├─ 0.314 DatetimeProperties.f pandas/core/accessor.py:111 │ [6 frames hidden] pandas ├─ 0.269 _LocIndexer.__getitem__ pandas/core/indexing.py:1176 │ [5 frames hidden] pandas └─ 0.128 Series.new_method pandas/core/ops/common.py:62 [2 frames hidden] pandas Wrote profile results to profiling.py.lprof Timer unit: 1e-06 s Total time: 0.703274 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: sanitize_accounts at line 357 Line # Hits Time Per Hit % Time Line Contents ============================================================== 357 @profile 358 def sanitize_accounts(self, df: pd.DataFrame, tax_codes: pd.DataFrame = None) -> pd.DataFrame: 359 """Discard incoherent account data. 360 361 Discard accounts with invalid currencies, and set invalid tax code references 362 to NA. Removed or modified entries are logged as warnings. 363 364 If no tax_codes DataFrame is provided, it will be fetched via the `tax_codes.list()` method, 365 as the validity of accounts may depend on referenced tax codes. 366 367 Args: 368 df (pd.DataFrame): The raw accounts DataFrame to validate. 369 tax_codes (pd.DataFrame, optional): Tax codes DataFrame for reference validation. 370 371 Returns: 372 pd.DataFrame: The sanitized DataFrame with valid account entries. 373 """ 374 4 17063.0 4265.8 2.4 df = enforce_schema(df, ACCOUNT_SCHEMA, keep_extra_columns=True) 375 376 4 4.0 1.0 0.0 def validate_currency() -> pd.Series: 377 """Validate that each account's currency is supported using the precision() method.""" 378 def is_valid(row): 379 try: 380 self.precision(ticker=row["currency"]) 381 return True 382 except ValueError: 383 return False 384 return df.apply(is_valid, axis=1) 385 386 # Validate currencies 387 4 683107.0 170776.8 97.1 valid_currency_mask = validate_currency() 388 4 50.0 12.5 0.0 if not valid_currency_mask.all(): 389 invalid = df.loc[~valid_currency_mask, "account"].tolist() 390 self._logger.warning( 391 f"Discarding {len(invalid)} accounts with invalid currencies: " 392 f"{first_elements_as_str(invalid)}." 393 ) 394 df = df.loc[valid_currency_mask] 395 396 # Validate tax codes 397 4 0.0 0.0 0.0 if tax_codes is None: 398 tax_codes = self.sanitize_tax_codes(self.tax_codes.list(), accounts=df) 399 4 258.0 64.5 0.0 valid_tax_codes = set(tax_codes["id"]) 400 401 4 2305.0 576.2 0.3 invalid_tax_code_mask = df["tax_code"].notna() & ~df["tax_code"].isin(valid_tax_codes) 402 4 78.0 19.5 0.0 if invalid_tax_code_mask.any(): 403 invalid = df.loc[invalid_tax_code_mask, "account"].tolist() 404 self._logger.warning( 405 f"Found {len(invalid)} accounts with invalid tax code references: " 406 f"{first_elements_as_str(invalid)}. Setting 'tax_code' to NA for these entries." 407 ) 408 df.loc[invalid_tax_code_mask, "tax_code"] = pd.NA 409 410 4 409.0 102.2 0.1 return df.reset_index(drop=True) Total time: 0.730885 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: sanitized_accounts_tax_codes at line 412 Line # Hits Time Per Hit % Time Line Contents ============================================================== 412 @profile 413 def sanitized_accounts_tax_codes(self): 414 """Sanitize accounts and tax codes 415 416 Accounts and tax_codes are interdependent. This method validates both 417 entities, navigating their mutual dependencies in a multi-step process: 418 1. Sanitize accounts with raw tax_codes. 419 2. Sanitize tax_codes using the partially sanitized accounts. 420 3. Re-sanitize accounts with the validated tax_codes. 421 422 Logs warnings for each discarded or adjusted entry. 423 424 Returns: 425 Tuple[pd.DataFrame, pd.DataFrame]: The sanitized accounts and tax_codes DataFrames. 426 """ 427 # Step 1: Partial accounts sanitization 428 2 6594.0 3297.0 0.9 raw_accounts = self.accounts.list() 429 2 7220.0 3610.0 1.0 raw_tax_codes = self.tax_codes.list() 430 2 372062.0 186031.0 50.9 accounts_df_step1 = self.sanitize_accounts(raw_accounts, tax_codes=raw_tax_codes) 431 432 # Step 2: Sanitize tax_codes using partially sanitized accounts 433 2 13744.0 6872.0 1.9 tax_codes_df = self.sanitize_tax_codes(raw_tax_codes, accounts=accounts_df_step1) 434 435 # Step 3: Re-validate accounts with the now fully validated tax_codes 436 2 331265.0 165632.5 45.3 accounts_df_final = self.sanitize_accounts(raw_accounts, tax_codes=tax_codes_df) 437 438 2 0.0 0.0 0.0 return accounts_df_final, tax_codes_df Total time: 12.1311 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: sanitize_journal at line 852 Line # Hits Time Per Hit % Time Line Contents ============================================================== 852 @profile 853 def sanitize_journal(self, df: pd.DataFrame) -> pd.DataFrame: 854 """ 855 Discard incoherent journal data. 856 857 This method discards ledger transactions - entries in the journal data 858 frame with the same 'id' - that: 859 1. Do not balance to zero. 860 2. Span multiple dates. 861 3. Have neither 'account' nor 'contra' specified. 862 4. Omit a profit center when profit centers exist in the system. 863 5. Reference an invalid currency, account, contra account, or profit 864 center. 865 6. Do not match the currency of a referenced account or contra account, 866 and the account or contra account is not denominated in 867 reporting currency. 868 7. Lack a valid price reference when 'report_amount' is missing and 869 the entry is in a non-reporting currency. 870 871 Also, undefined tax code references are removed from journal entries. 872 873 A warning specifying the reason is logged for each discarded entry. 874 875 Args: 876 df (pd.DataFrame): Journal data to sanitize. 877 878 Returns: 879 pd.DataFrame: Sanitized journal data containing only valid entries. 880 """ 881 882 1 8876.0 8876.0 0.1 df = enforce_schema(df, JOURNAL_SCHEMA, keep_extra_columns=True) 883 884 1 0.0 0.0 0.0 invalid_ids = set() 885 1 3087.0 3087.0 0.0 invalid_ids = self._invalid_multidate_txns(df, invalid_ids) 886 1 411656.0 411656.0 3.4 self._invalid_tax_codes(df, invalid_ids) 887 1 323325.0 323325.0 2.7 invalid_ids = self._invalid_accounts(df, invalid_ids) 888 1 2542811.0 3e+06 21.0 invalid_ids = self._invalid_assets(df, invalid_ids) 889 1 3339395.0 3e+06 27.5 invalid_ids = self._invalid_currency(df, invalid_ids) 890 1 1839535.0 2e+06 15.2 invalid_ids = self._invalid_prices(df, invalid_ids) 891 1 2883.0 2883.0 0.0 invalid_ids = self._invalid_profit_centers(df, invalid_ids) 892 1 3653879.0 4e+06 30.1 df["report_amount"] = self._fill_report_amounts(df, invalid_ids) 893 1 3148.0 3148.0 0.0 invalid_ids = self._unbalanced_report_amounts(df, invalid_ids) 894 895 1 2500.0 2500.0 0.0 return df.query("id not in @invalid_ids").reset_index(drop=True) Total time: 0.411636 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_tax_codes at line 910 Line # Hits Time Per Hit % Time Line Contents ============================================================== 910 @profile 911 def _invalid_tax_codes(self, df: pd.DataFrame, invalid_ids: set) -> None: 912 """Drop undefined 'tax_code' references.""" 913 1 409301.0 409301.0 99.4 _, tax_codes = self.sanitized_accounts_tax_codes() 914 1 1345.0 1345.0 0.3 invalid_tax_code_mask = ~df["tax_code"].isna() & ~df["tax_code"].isin(set(tax_codes["id"])) 915 1 309.0 309.0 0.1 new_invalid_ids = set(df.loc[invalid_tax_code_mask, "id"]) - invalid_ids 916 1 0.0 0.0 0.0 if new_invalid_ids: 917 self._logger.warning( 918 f"Setting 'tax_code' to 'NA' for {len(new_invalid_ids)} journal entries " 919 f"with invalid tax codes: {first_elements_as_str(new_invalid_ids)}" 920 ) 921 1 681.0 681.0 0.2 df.loc[invalid_tax_code_mask, "tax_code"] = pd.NA Total time: 0.323302 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_accounts at line 923 Line # Hits Time Per Hit % Time Line Contents ============================================================== 923 @profile 924 def _invalid_accounts(self, df: pd.DataFrame, invalid_ids: set) -> set: 925 """ 926 Mark transactions with invalid accounts: 927 - both 'account' and 'contra' are missing, or 928 - a referenced account is not defined.""" 929 1 321610.0 321610.0 99.5 accounts, _ = self.sanitized_accounts_tax_codes() 930 1 40.0 40.0 0.0 accounts_set = set(accounts["account"]) 931 932 # Transactions missing both 'account' and 'contra' 933 1 234.0 234.0 0.1 missing_mask = df["account"].isna() & df["contra"].isna() 934 1 276.0 276.0 0.1 missing_ids = set(df.loc[missing_mask, "id"]) - invalid_ids 935 1 0.0 0.0 0.0 if missing_ids: 936 self._logger.warning( 937 f"Discarding {len(missing_ids)} journal entries with neither 'account' " 938 f"nor 'contra' specified: {first_elements_as_str(missing_ids)}" 939 ) 940 invalid_ids = invalid_ids.union(missing_ids) 941 942 # Transactions with invalid account or contra references 943 1 416.0 416.0 0.1 invalid_account_mask = ~df["account"].isna() & ~df["account"].isin(accounts_set) 944 1 374.0 374.0 0.1 invalid_contra_mask = ~df["contra"].isna() & ~df["contra"].isin(accounts_set) 945 1 95.0 95.0 0.0 ref_mask = invalid_account_mask | invalid_contra_mask 946 1 256.0 256.0 0.1 ref_ids = set(df.loc[ref_mask, "id"]) - invalid_ids - missing_ids 947 1 0.0 0.0 0.0 if ref_ids: 948 self._logger.warning( 949 f"Discarding {len(ref_ids)} journal entries with invalid account " 950 f"or contra references: {first_elements_as_str(ref_ids)}" 951 ) 952 invalid_ids = invalid_ids.union(ref_ids) 953 954 1 1.0 1.0 0.0 return invalid_ids Total time: 2.5428 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_assets at line 956 Line # Hits Time Per Hit % Time Line Contents ============================================================== 956 @profile 957 def _invalid_assets(self, df: pd.DataFrame, invalid_ids: set) -> set: 958 """Mark transactions with invalid asset references.""" 959 1 0.0 0.0 0.0 def is_invalid(row): 960 if row["id"] in invalid_ids: 961 return True 962 try: 963 self.precision(ticker=row["currency"], date=row["date"]) 964 return False 965 except ValueError: 966 return True 967 968 1 2542491.0 3e+06 100.0 row_mask = df.apply(is_invalid, axis=1, result_type="reduce") 969 1 310.0 310.0 0.0 new_invalid_ids = set(df.loc[row_mask, "id"]) - invalid_ids 970 1 0.0 0.0 0.0 if new_invalid_ids: 971 self._logger.warning( 972 f"Discarding {len(new_invalid_ids)} journal entries with invalid currency: " 973 f"{first_elements_as_str(new_invalid_ids)}" 974 ) 975 invalid_ids = invalid_ids.union(new_invalid_ids) 976 977 1 0.0 0.0 0.0 return invalid_ids Total time: 3.33938 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: _invalid_currency at line 979 Line # Hits Time Per Hit % Time Line Contents ============================================================== 979 @profile 980 def _invalid_currency(self, df: pd.DataFrame, invalid_ids: set) -> set: 981 """Mark transactions with currency mismatched to account or contra account.""" 982 1 0.0 0.0 0.0 def is_invalid(row): 983 if row["id"] in invalid_ids: 984 return True 985 if row["amount"] == 0: 986 return False 987 if pd.notna(row["account"]): 988 account_currency = self.account_currency(row["account"]) 989 if ((account_currency != self.reporting_currency) 990 and (row["currency"] != account_currency)): 991 return True 992 if pd.notna(row["contra"]): 993 contra_currency = self.account_currency(row["contra"]) 994 if ((contra_currency != self.reporting_currency) 995 and (row["currency"] != contra_currency)): 996 return True 997 return False 998 999 1 3339169.0 3e+06 100.0 invalid_currency = df.apply(is_invalid, axis=1, result_type='reduce') 1000 1 211.0 211.0 0.0 new_invalid_ids = set(df.loc[invalid_currency, "id"]) - invalid_ids 1001 1 0.0 0.0 0.0 if new_invalid_ids: 1002 self._logger.warning( 1003 f"Discarding {len(new_invalid_ids)} journal entries with mismatched transaction " 1004 f"currency: {first_elements_as_str(new_invalid_ids)}" 1005 ) 1006 invalid_ids = invalid_ids.union(new_invalid_ids) 1007 1008 1 1.0 1.0 0.0 return invalid_ids Total time: 0.01587 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: _assets_as_dict_of_df at line 1453 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1453 @property 1454 @timed_cache(120) 1455 @profile 1456 def _assets_as_dict_of_df(self) -> Dict[str, pd.DataFrame]: 1457 """Organize assets by ticker for quick access. 1458 1459 Splits assets by ticker for efficient lookup of increments by ticker 1460 and date. 1461 1462 Returns: 1463 Dict[str, pd.DataFrame]: Maps each asset ticker to a DataFrame of 1464 its `increment` history, sorted by `date` with `NaT` values first. 1465 """ 1466 7 990.0 141.4 6.2 return { 1467 12 3.0 0.2 0.0 ticker: ( 1468 6 2376.0 396.0 15.0 group[["date", "increment"]] 1469 6 1082.0 180.3 6.8 .sort_values("date", na_position="first") 1470 6 302.0 50.3 1.9 .reset_index(drop=True) 1471 ) 1472 7 11117.0 1588.1 70.1 for ticker, group in self.sanitize_assets(self.assets.list()).groupby("ticker") 1473 } Total time: 5.30135 s File: /Users/lukas/macx/pyledger/pyledger/ledger_engine.py Function: precision at line 1475 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1475 @profile 1476 def precision(self, ticker: str, date: datetime.date = None) -> float: 1477 """Returns the smallest price increment of an asset or currency. 1478 1479 This is the precision, to which prices should be rounded. 1480 1481 Args: 1482 ticker (str): Identifier of the currency or asset. 1483 date (datetime.date, optional): Date for which to retrieve the precision. 1484 Defaults to today's date. 1485 1486 Returns: 1487 float: The smallest price increment. 1488 """ 1489 1490 8546 1379.0 0.2 0.0 if ticker == "reporting_currency": 1491 ticker = self.reporting_currency 1492 1493 8546 1010.0 0.1 0.0 if date is None: 1494 870 2146.0 2.5 0.0 date = datetime.date.today() 1495 7676 2656.0 0.3 0.1 elif not isinstance(date, datetime.date): 1496 date = pd.to_datetime(date).date() 1497 1498 8546 1986858.0 232.5 37.5 asset = self._assets_as_dict_of_df.get(ticker) 1499 8546 1161.0 0.1 0.0 if asset is None: 1500 # Asset is not defined by the user, fall back to hard-coded defaults 1501 increment = DEFAULT_ASSETS.loc[DEFAULT_ASSETS["ticker"] == ticker, "increment"] 1502 if len(increment) < 1: 1503 raise ValueError(f"No asset definition available for ticker '{ticker}'.") 1504 if len(increment) > 1: 1505 raise ValueError(f"Multiple default definitions for asset '{ticker}'.") 1506 return increment.item() 1507 else: 1508 8546 2340053.0 273.8 44.1 mask = asset["date"].isna() | (asset["date"] <= pd.Timestamp(date)) 1509 8546 71235.0 8.3 1.3 if not mask.any(): 1510 raise ValueError( 1511 f"No asset definition available for '{ticker}' on or before {date}." 1512 ) 1513 8546 894852.0 104.7 16.9 return asset.loc[mask[mask].index[-1], "increment"].item() ```
ChatGPT Summary Here's a **concise but detailed summary** of your profiling output: --- ## **Overall Summary** - **Main focus**: Profiling of `sanitize_journal()` in `ledger_engine.py`. - **Total runtime**: ~12.13 seconds. - **Dominant cost**: 99%+ of the time is spent inside `sanitize_journal()` and its helpers. - **Top Bottlenecks** | Function / Area | Time (sec) | % of Total | Notes | | :--- | :--- | :--- | :--- | | `sanitize_journal` | 12.13 | 100% | Main entry point | | ├─ `_invalid_currency` | ~3.34 | ~27.5% | Checking account/contra currency mismatches | | ├─ `_invalid_assets` | ~2.54 | ~21% | Validating assets and currencies | | ├─ `_invalid_prices` | ~1.84 | ~15% | Checking prices/values | | ├─ `_fill_report_amounts` | ~3.65 | ~30.1% | Regenerating missing report amounts | | ├─ `_invalid_tax_codes` | ~0.41 | ~3.4% | Tax code validation | | ├─ `_invalid_accounts` | ~0.32 | ~2.7% | Account existence checks | | └─ `_invalid_profit_centers` | ~0.003 | Negligible | Profit center checks | ## **Deeper Causes** - **Pandas `.apply()`**: - Used extensively (slow for row-wise operations). - Heavy in `_invalid_currency`, `_invalid_assets`, `_invalid_prices`, etc. - **Precision Lookup (`precision()` method)**: - **Very costly**. - ~5.3s total runtime. - Frequent `deepcopy()` during asset lookup and precision calculation. - Accessing cached or grouped assets inside `_assets_as_dict_of_df` takes time. - **Deepcopy Overhead**: - Significant time inside `deepcopy()` when validating precision and price information. - Happens in decorators and asset/price fetching logic. --- ## **Highlights by Method** - **`sanitize_accounts()`** (minor part) - Spends 97% of time validating currencies via `.apply()`. - Minor time on tax code validation. - **`sanitized_accounts_tax_codes()`** (small) - Half the time (~51%) on first `sanitize_accounts()`. - Other half revalidating accounts after tax code sanitation. - **`precision()`**: - ~5.3s total. - Major hotspot. - 44% of time reading latest asset precision from cached data. - 37% of time inside asset dictionary lookup. ## **Main Takeaways** - **Biggest Cost**: Validations that call `.apply()` and require expensive asset precision lookups. - **Biggest Opportunity**: - Replace `.apply()` with vectorized operations if possible. - Cache precision values better (memoization) to avoid repeated `deepcopy()` and asset searches. - Optimize asset structure (indexed by ticker+date?) to make precision fetching cheaper. - **Secondary**: - Reduce unnecessary deepcopy operations. - Preload sanitized tax codes and accounts once instead of inside each invalidity check.