quentinsf / icsv2ledger

Interactive importing of CSV files to Ledger
196 stars 70 forks source link

--skip-dupes does not work #98

Closed mondjef closed 7 years ago

mondjef commented 7 years ago

I run the following command and then after entering a few transactions I Ctrl-C to exit then re-run the command expecting that the transactions that I have already previously imported to be skipped as a result of using the --skip-dupes option.

./icsv2ledger.py -c config -a PCF_CHQ_JEFF --skip-dupes --incremental --ledger-file /root/ledger/test.dat /root/ledger/csv_imports/PCF.csv /root/ledger/test1.dat

I am using the default embedded template that includes the ;CSV comment line which as I understand is used by the --skip-dupes option.

here is an example of what gets added to the outfile, am I missing something or is this not the way the --skip-dupes suppose to work?

2016/04/08 * TRANSFER IN ; MD5Sum: 3817f6ad5b24d1c65cb08568b27d0d88 ; CSV: 04/08/2016, TRANSFER IN,,198 Liabilities:Bank:PCF:LoC Assets:Bank:PCF:Checking $ 198

mondjef commented 7 years ago

ok, I am not a Python expert (yet...its something I have been learning recently), but I think I figured out the why part of my issue but it leaves me wondering what the behavior should be. Looking at the code it appears it looks at the ledger file specified (in my case '/root/ledger/test.dat') and populates 'csv_comments' which is then used for matching by the skip-dupes routine. What this means is that the ledger file that is specified for retrieving account names and the sort is also being used as the historical transaction file for finding duplicates and not the 'outfile'.

Because of this, when a user specifies a ledger file that is not the same as the 'outfile' any subsequent runs of the same csv file with the same options will cause duplicates in the output file. I am unsure what should be the expected behavior as I am new to this. Should the specified ledger file be used to get existing transactions in addition to accounts and payees, if so the ledger file option description should be expanded to reflect this so that users are aware of this? Should another option be added to give users more control over this...i.e. ability to use ledger file and/or outfile when looking for duplicate entries? Should existing transactions in 'outfile' also be indexed at the start when looking for duplicates to catch duplicate entries previously added to 'outfile' from the same csv file? Should we be concerned about duplicate entries from the same csv file?

In addition, using the above example to illustrate the issue I am dealing with to avoid introducing duplicate entries.

When I transfer money from my line of credit to my checking account this is how it appears.

Seen from my checking account: 2016/04/08 * TRANSFER IN ; MD5Sum: 3817f6ad5b24d1c65cb08568b27d0d88 ; CSV: 04/08/2016, TRANSFER IN,,198 Liabilities:Bank:PCF:LoC Assets:Bank:PCF:Checking $ 198

Seen from my line of credit account. 2016/04/08 * TRANSFER OUT ; MD5Sum: ? ; CSV: 04/08/2016, TRANSFER OUT,198, Liabilities:Bank:PCF:LoC $ 198 Assets:Bank:PCF:Checking

As you can see, I don't think using the CSV string would identify that fact that when I import transactions for my Line of credit that this transaction has already been entered when I imported my transactions for my checking account. However, looking at the ledger transaction information it is pretty clear that this more than likely is a duplicate. I would propose to maybe 'fuzzy' hash the CSV line and a second hash for the ledger transaction information using something like ssdeep and then add options to allow users to determine how they want the duplication detection to work. Everything else with this script is great but I need the duplication process to be robust enough to handle the way my accounts interact with one another. I will play around with the script to see if I can come up with something, but my Python skills are lacking so I am not sure what the success will be.

petdr commented 7 years ago

I model the above by doing the following

2016/04/08 * TRANSFER IN Transfer Assets:Bank:PCF:Checking $ 198

2016/04/08 * TRANSFER OUT Liabilities:Bank:PCF:LoC $ 198 Transfer

in other words when transferring between accounts you put the money into a special account called Transfer, and so you no longer end up with duplicate transactions. It also allows you to transfer money between accounts you own at different banks where is disappears on Jan 1 but appears in the new account on Jan 3.

Feel free to have a go at doing a more fuzzy detection, would be interested to see it.

mondjef commented 7 years ago

That's an interesting option and a viable work around that did not even cross my mind. I will go with this solution for now as the transfer account should always have a balance of $0 after all my accounts have been processed for a period. I would however still like to improve the script where I can for example cluster classifier to make a best guess at the payee and accounts and with duplication detection. The ledger Reckon ruby gem script uses some of those metrics.

mondjef commented 7 years ago

closing this for now as it works as intended and will create a pull request if come up with a more robust solution.