simonmichael / hledger

Robust, fast, intuitive plain text accounting tool with CLI, TUI and web interfaces.
https://hledger.org
GNU General Public License v3.0
3.03k stars 321 forks source link

Input validation: do descriptions support characters like semicolon (;)? #1871

Open jfly opened 2 years ago

jfly commented 2 years ago

First off: thanks for a wonderful tool with tons of documentation! I'm just getting my feet wet with PTA, and this tool and its community resources have been great.

I'm using hldeger to import CSV statements from my bank. Some of the descriptions in that csv file contain semicolons. For example:

$ cat in.csv
2020-01-01,"a description; with a semicolon","this is a comment",50
$ cat in.csv.rules
fields date,description,comment,amount
decimal-mark ,

Note that in this example, the description is a string containing a semicolon: a description; with a semicolon, and there's a comment that has no special characters: this is a comment.

If I try to parse this with hledger, I get this:

$ hledger -f in.csv print
2020-01-01 a description; with a semicolon  ; this is a comment
    expenses:unknown              50
    income:unknown               -50

It's not obvious to the human eye unless you're very familiar with looking at these journal files, but this journal file represents something different. The description lost its semicolon and everything after the semicolon, and the comment got prefixed with part of the comment. This is easier to see if you parse this with hledger and output it as JSON:

$ hledger -f in.csv print | hledger -f- print -O json | jq '.[0] | with_entries(select([.key] | inside(["tcomment", "tdescription"])))'
{
  "tcomment": "with a semicolon  ; this is a comment\n",
  "tdescription": "a description"
}

I'm not sure what that newline at the end of tcomment is about, but hopefully it's clear how the description and comment got mangled.

What I'm unclear on is if it should even be possible to preserve the original description and comment. I've read hledger's description of the journal format and ledger's description as well. I don't see any mechanism for quoting the description or escaping semicolons, so I think the grammar just doesn't allow for it, but I haven't tried to read any source code to confirm.

If I'm correct that some characters (such a ; are not allowed), then I think that hledger is missing some useful validations somewhere: I'd much rather not be allowed to import the original CSV and be forced to deal with the semicolons myself (my plan right now is to preprocess my bank's CSVs to remove or replace any semicolons in the descriptions).

For the record, I checked, and hledger add seems to suffer from a similar bug/lack of validation: it lets you type in a description with a semicolon, but the resulting journal treats that as a comment:

$ hledger add -f demo.journal
Adding transactions to journal file /home/jeremy/src/github.com/jfly/manmanmon/demo/demo.journal
Any command line arguments will be used as defaults.
Use tab key to complete, readline keys to edit, enter to accept defaults.
An optional (CODE) may follow transaction dates.
An optional ; COMMENT may follow descriptions or amounts.
If you make a mistake, enter < at any prompt to go one step backward.
To end a transaction, enter . when prompted.
To quit, enter . at a date prompt or press control-d or control-c.
Date [2022-06-12]:
Description: this is a description; with a semicolon
Account 1: groceries
Amount  1: 50
Account 2: bank
Amount  2 [-50]:
Account 3 (or . or enter to finish this transaction): .
2022-06-12 this is a description  ; with a semicolon
    groceries              50
    bank                  -50

Save this transaction to the journal ? [y]: y
Saved.
Starting the next transaction (. or ctrl-D/ctrl-C to quit)
Date [2022-06-12]:
$ cat demo.journal
; journal created 2022-06-12 by hledger

2022-06-12 this is a description  ; with a semicolon
    groceries              50
    bank                  -50
$ hledger -f demo.journal print -O json | jq '.[0] | with_entries(select([.key] | inside(["tcomment", "tdescription"])))'
{
  "tcomment": "with a semicolon\n",
  "tdescription": "this is a description"
}
simonmichael commented 2 years ago

Thanks for the report, and sorry for just noticing it got overlooked and for not spending time on it yet. Any help welcome.

simonmichael commented 2 years ago

Indeed this seems unclear. I thought we did support semicolons in descriptions, as they are quite common in the wild in my experience.

We do support semicolons in account names, at least according to https://hledger.org/1.26/hledger.html#account-comments. (I'm not really sure why, except Ledger and we probably always did and I didn't want to make a breaking change.)

jfly commented 2 years ago

I'd be happy to help, but honestly not clear on what we want to do.

Would you be open to some validations somewhere in hledger that prevent semicolons in descriptions?

simonmichael commented 2 years ago

I would probably first clarify the status quo:

jfly commented 2 years ago

Turns out I'm pretty busy right now, so I don't think I'll have time to look into this deeper anytime soon, sorry. Next time I have some available open source time, I will look into this if someone else hasn't already.

I wanted to leave one quick comment though.

  • clarify whether this a csv only or more general issue

I did mention above that hledger add also has a similar problem: if you try to add a description with a semicolon in it, it gets split up into a description and a comment. So this doesn't feel csv specific to me. Does that answer your question?