fixed for current ING DiBa CSV files

codedump commented 3 years ago

Current DiBa files apparently have one extra column, and probably one extra line somewhere. Here's the header of one of mine, personal details removed, changed for Umlauts in UTF-8 for better display on the web:

Umsatzanzeige;Datei erstellt am: 26.02.2021 18:43
;Letztes Update: aktuell

IBAN;DExx xxxx xxxx ....
Kontoname;Girokonto
Bank;ING
Kunde;My Name
Zeitraum;02.02.2020 - 26.02.2021
Saldo;x.xxxx,xx;EUR

Sortierung;Datum absteigend

In der CSV-Datei finden Sie alle bereits gebuchten Umsätze. Die vorgemerkten Umsätze werden nicht aufgenommen, auch wenn sie in Ihrem Internetbanking angezeigt werden.

Buchung;Valuta;Auftraggeber/Empfänger;Buchungstext;Kategorie;Verwendungszweck;Saldo;Währung;Betrag;Währung
...

But the layout is the same.

siddhantgoel commented 3 years ago

This seems like a duplicate of https://github.com/siddhantgoel/beancount-ing-diba/issues/70 . Apparently in the ING interface you can switch categories on/off and accordingly the data shows up in the CSV files (or not).

As a temporary fix, you can disable categories in your account before exporting the data (and enable them again once you're done).

Thanks for taking a shot at this though! I think you would need to implement some sort of a fallback here i.e. check in case the header contains a Kategorie and then read the remaining entries accordingly.

codedump commented 3 years ago

I've implemented a dirty workaround for a 9 or 10 column CSV file.

A cleaner way would be to refactor the script and access the fields by their name, that would work regardless of the column setup. But that's probably overengineering, it's just supposed to import some text and be one with it :-)

Anyway, apparently tests fail, but I'm too inexperienced to figure out what exactly the problem is by clicking my way through the GitHub interface... any idea?

codedump commented 3 years ago

I've changed the whitespace thing, should be ok now. Regarding this:

I would've suggested having another tuple (similar to FIELDS defined in the beginning of this module) which contains an alternate set of field names (this one including Kategorie).

Assuming we have two tuples called FIELDS_ORIGINAL and FIELDS_CATEGORIZED (naming up for discussion), inside the for loop where we check line == list(FIELDS) (line 207), we could instead check what's the field set we're working with.

So if line == FIELDS_ORIGINAL, then unpack line using 9 variables. And if line == FIELDS_CATEGORIZED, then unpack line using 10 variables, ignoring the category. For now this should be sufficient since I personally don't know of more ways that ING can modify the export.

What do you think?

Frankly, if I were you, I'd leave it as it is. It works, it's tested, just call it a day.

BUT: since you directly ask for my opinion :) I can take some time to elaborate why I think this way.

Start with the question: what would be the goal of changing that?

More elegant code? Nope. Whatever you save in lines, you invest in more (unnecessary) program flow code (if...). That's actually the worst kind of code, because if you're being as serious about testing as you appear to be, you need to test for every branching -- getting an if wrong (== vs not ==, < vs <= vs >, etc) is usually where all the bugs are.
Improved readability? Nope. You now have to jump to 2 different locations to know exactly what the code actually does (definition of FIELDS_xxx and the function itself -- which is already 150 lines and by definition a readability nightmare to begin with). Given that most people who'll use your code are going to have at most a marginal contact, with no continuity, but will have to fix it fast when it breaks (because DiBa, fuck yeah! :smile: ), the ability to quickly scan what it's about is most important.
Maintainability? Nope. You still just cover exactly one single variation, namely exactly this version of "categorized" fields. If anything changes -- another field pops up, fields change order etc -- you're back at square 1. And your proposed solution doesn't scale any better in integrating future changes than the current.

So in essence, you're looking at several hours of refactoring (including testing etc) for no added benefit. As I said in the title, my fix is a "hack", and I don't believe that you (or anybody) should go deeper into that. As I understand it, this is just a quick'n'dirty parser. Sinking more time here is overengineering.

If you're hell bent on improving stuff just for the sake of self-improvement (given the flake8 testing harness I take it that you're trying to write good code :-), there's other lower-hanging fruit you can attend to. Here's what I'd do, but YMMV:

Kill the one-liner functions (_format_iban, _format_number_de). I undertand you're trying to avoid duplication, but -- again -- you're writing a small script. This is the equivalent of a Bash dozen-liner, not a large package like Beancount itself. You have at most 1-2 calls for each of these functions, it will never grow beyond that, and you're trading a (simple) explicit line for another simple, but less transparent, line. (As a sidenote: being "abstract" is usually good, but "abstract" is also the opposite of "specific", and it's "specific" things that your code ultimately needs to achieve in the real world. "Abstract" is just a way of doing that easier / cheaper / faster, and should go out of the way when it isn't helping.)
Reduce the extract() function -- it's 150 lines long, that's the biggest problem your code currently has with readability. Sometimes it's ok to have functions that long if they are sufficiently simple (no-brainers), but you also have about 3 or 4 levels of indentation, so it's by no means a "no-brainer".
Last but not least: change the data model. This is probably the most complex one, but worth pursuing if you seriously think about sinking more than 1 hour into this script. If I were you, I'd move away from explicit unpacking of tuples (var, i, able) = func()) to either index-based unpacking: tup = func() / (var, i, able) = (tup[0], tup[1], tup[-1]), which allows you to kill the current problem of a single spurious extra column without a single if by using negative indexing; or to something more "natural" for the problem space, like something dictionary-based, which gives you more robustness against column-shuffling. However, this may introduce problems of its own. It would essentially be something along the lines of:
```
...
# reading column header row
keys = [ k in line ]
...
# reading data rows
for line in reader:
    row = { k:v for k,v in zip(keys,line) } 

    # do the unpacking based on dictionary keys
    amount = Amount(row["Betrag"].replace(".",".").replace(",",""),
                    row["Währung"])
    ...
```
If you go for the last approach, you're pretty much immune to any row shuffling as long as the information you need is there, but you open yourself to problems if DiBa decides to rename rows. Also, you have to deal with non-unique row names (like "Währung"). In any case, you need to pick your poison :-)

But your current data model (direct unpacking) has outlived its usefulness the moment you decide to support two different sets of columns (i.e. with/without "Kategorie"). That's simply something it cannot represent, and any fix you're going to base on this model is necessarily going to be a hack.

In any case, if you play your cards right, you probably can get your whole script down to 100 lines (from currently >250), and extract() down to 50 or so, which easily fit into a single scroll screen on an average IDE. That's probably something worth investing into, if you want to spend an extra 4-5 hours or so. OTOH, the current code works, it's tested -- as I said, I'd call it a day. It's not that important, it's just a tiny parse module that anyone crazy enough to use Beancount can easily fix for themselves :man_shrugging:

Hope that helps! :-)

Cheers, F.

siddhantgoel commented 3 years ago

Instead of the long wall of text (and a few unnecessary personal attacks in some places), a simple "no this suggestion is not good but here's another non-hack of an approach" would've been more helpful. :slightly_smiling_face:

I'm not going to argue against the software design arguments you put forward. Mostly because such discussions tend to get philosophical and ultimately go nowhere. I do think that the concern about over-engineering is kind of justified. But I think the concern of this repository ending up being a dumping ground for "quick'n'dirty" hacks is equally justified.

I'm actually not sure why you feel the need to criticize the rest of the code, outside the scope of the "category" column. This code started out as a personal script and gets maybe 1 hour a month of development time, if at all. In any case, here are some inline comments:

Kill the one-liner functions (_format_iban, _format_number_de). I undertand you're trying to avoid duplication, but -- again -- you're writing a small script. This is the equivalent of a Bash dozen-liner, not a large package like Beancount itself.

Do small scripts need to look like bash? I'd much rather like to read _format_iban than re.sub(r'\s+', '', value, flags=re.UNICODE). Going back to your point on maintainability where you write "the ability to quickly scan what it's about is most important.": _format_iban immediately tells me what it's about. re.sub forces me to switch context. I personally find it helpful to read code that doesn't force me to switch context every few lines, regardless of the overall number of lines of code.

Reduce the extract() function

Yep, this one bugs me every time I see it. This has been on my personal list for a while now. Just haven't had any time to do it.

change the data model.

Again, something I've had in mind for a long time now. I also don't expect it to be more than a few hours, but it would definitely make things much nicer to work with. I personally don't see myself being able to make time for this in (at least) the next few weeks. Maybe I'll create a Github issue for it, in case someone else has time. Might be a good fit for people learning Python.

In any case, I'm assuming that you have this fixed locally so you don't have any urgent (relatively speaking) need to have this merged. Let me know if you do. I don't think I'll merge the hack in though.

If you have a different approach that's not a hack, feel free to bring it up and we can talk about it.

codedump commented 3 years ago

a few unnecessary personal attacks in some places [...]

I didn't have the intention of attacking, I apologize if that's how it came across.

But I think the concern of this repository ending up being a dumping ground for "quick'n'dirty" hacks is equally justified.

It's your repo. I've told you what I thought of the proposal, with comprehensive explanations of the reasoning behind my opinion. The rest is up to you. :shrug:

I'm actually not sure why you feel the need to criticize the rest of the code, outside the scope of the "category" column.

See above. And also, see below. I'm pretty sure that I've explained that in my "wall of text":

Start with the question: what would be the goal of changing that? [...but then...] you're looking at several hours of refactoring (including testing etc) for no added benefit. [...instead...] Here's what I'd do, but YMMV

In other words: I didn't think staying within the "scope of the "category" column" was useful. I explaind that, and why I thought that. And that was it.

I guess I could have just said "bad idea, it's useless" without further explanation, and leave it at that. I would've spent less time, but that would have been a useless statement. Instead I invested time to explain my reasoning.

If that's not your kind of sport, or what you wanted to hear, that's ok. I can accept that.

you don't have any urgent (relatively speaking) need to have this merged [...]

That's true, I had this running before I actually wrote the PR, and I'm actually keeping my version of the code with my other finance tooling (over at GitLab). The only reason I forked your repo here in the first place was to (help you) make it available to other people, as a way to say "thanks for the effort".

But it's your project, do as you see fit.

As far as I'm concerned, I'll probably be extracting my data from the HBCI context logs anyway in the long run. It gives me a larger amount of automatization. Also it's a relatively stable intermediate format (unlike the CSV files) that's independent on individual banks, and DiBa supports HBCI well enough.

Cheers, F.

siddhantgoel commented 3 years ago

Hey, I realized the other day that the fix was as simple as switching to csv.DictReader, so I did that and push v0.4.0 to PyPI. Please upgrade in case you're still using the CSV exports and haven't switched to HBCI yet.

codedump commented 3 years ago

You need to be careful, as I already pointed out:

Also, you have to deal with non-unique row names (like "Währung")

"Währung" for example is present once for the transaction itself and once for the account total balance. I'm not sure which one your code picks (probably the latter one by incident?).

siddhantgoel commented 3 years ago

Oh yeah good point. Somehow totally missed that. Just pushed v0.4.1 which should take care of this.

siddhantgoel / beancount-ing

fixed for current ING DiBa CSV files #71