howeyc / ledger

Command line double-entry accounting program
https://howeyc.github.io/ledger/
ISC License
455 stars 43 forks source link

Improve import account matching #41

Closed porjo closed 1 year ago

porjo commented 1 year ago

Currently import will use whichever account scores highest from the Baysean classifier, even if that score is low confidence. This results in accounts being matched that have no relevance to the transaction.

Where there is a low confidence match, it would be better to default to account 'unknown:unknown' so we can know there wasn't a match. Those transactions can be manually fixed later. This PR tracks the high score and the second highest score, and compares the two. If the difference is significant (greater than 10), then we know that the highscore is high confidence. Otherwise default to account 'unknown:unknown'.

The value of 10 is just a rough guide. I noticed that matches of low confidence were within 2-3 points of each other. A partial match (e.g. a single common word was present) had a points difference of about 30 over the low confidence matches. An exact match had a points difference of about 30 again over the partial match.

howeyc commented 1 year ago

LGTM