simonmichael / hledger

Robust, fast, intuitive plain text accounting tool with CLI, TUI and web interfaces.
https://hledger.org
GNU General Public License v3.0
2.91k stars 315 forks source link

hledger uses too much memory with large files [1mo] #275

Open MarkusTeufelberger opened 9 years ago

MarkusTeufelberger commented 9 years ago

I have a ledger of about 40 MB worth of files, hledger 0.26 now suddenly chokes on them, the previous version worked fine (well it also took waaay to long to calculate balances, but at least it worked):

hledger bal -f ./output/full.ledger
hledger: out of memory

full.ledger just imports a couple of other files (I have a collection of scripts that generate them from CSV files) that total to 37 MB. My machine has 16 GB of RAM, so this is not very likely an issue.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/24719851-hledger-uses-too-much-memory-with-large-files-1mo?utm_campaign=plugin&utm_content=tracker%2F536505&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F536505&utm_medium=issues&utm_source=github).
simonmichael commented 9 years ago

Very interesting, thanks for the report. (I'd be interested to see as much of the stats output as you'd care to show for that..)

MarkusTeufelberger commented 9 years ago

Well, the stats command runs out of memory too... maybe I'll find a way to build a 64 bit binary on Windows.

On Linux I get the following error (same for stats and bal): [me@mycomputer foldername]$ ~/.cabal/bin/hledger bal -f ./output/full.ledger hledger: /path/to/foldername/./output/paypal_com.ledger: hGetContents: invalid argument (invalid byte sequence)

I suspect there is some issue with Unicode in Linux (the file contains some japanese charaters as PayPal username), not necessarily related to the OOM issue. Maybe it could be my script that is not properly flagging the file encoding?

After manually saving the file as UTF-8 in a text editor it worked (both stats and bal), even though it takes a LONG time (a couple seconds at least) to produce the output:

[me@mycomputer foldername]$ ~/.cabal/bin/hledger stats -f ./output/full.ledger 
Main journal file        : /path/to/foldername/./output/full.ledger
Included journal files   : /path/to/foldername/./output/XXX.ledger           
                           /path/to/foldername/./output/YYY.ledger
                           etc. etc. etc.
Transactions span        : 2000-01-01 to 2015-07-22 (5681 days)
Last transaction         : 2015-07-21 (1 days ago)
Transactions             : 295923 (52.1 per day)
Transactions last 30 days: 176881 (5896.0 per day)
Transactions last 7 days : 38687 (5526.7 per day)
Payees/descriptions      : 411
Accounts                 : 218 (depth 5)
Commodities              : 8 (AAA, BBB, CCC, DDD, EEE, FFF, GGG, HHH)

A few things are censored, the numbers are accurate though. Real transactions don't really start in 2000, more like in 2010 or so - the few ones in 2000 are moving "Equity" amounts into place for having correct opening balances. Most of this stuff is auto-generated from CSV files and some of them are relatively verbose/microtransaction heavy.

simonmichael commented 9 years ago

A quarter million transactions - very cool! Now I'm motivated to test with some larger data sets.

The unicode thing can probably be fixed by setting LANG to an appropriate locale (cf http://hledger.org/manual.html#run-time-problems). GHC-compiled software tends to fail this way when it encounters non-ascii data and the system locale is not set. We should give a better error message (#73).

simonmichael commented 9 years ago

Hmm. I didn't see this in pre-release testing, and I don't see it now with the simple test journals. Time is the same:

$ time tools/simplebench -n1 hledger-0.24 hledger-0.25 hledger-0.26
Using bench.tests
Running 6 tests 1 times with 3 executables at 2015-07-22 13:53:31.42249 UTC:

Summary (best iteration):

+------------------------------------++--------------+--------------+--------------+
|                                    || hledger-0.24 | hledger-0.25 | hledger-0.26 |
+====================================++==============+==============+==============+
| bal -f data/100x100x10.journal     ||         0.04 |         0.04 |         0.04 |
| bal -f data/1000x1000x10.journal   ||         0.21 |         0.25 |         0.24 |
| bal -f data/1000x10000x10.journal  ||         0.31 |         0.32 |         0.35 |
| bal -f data/10000x1000x10.journal  ||         1.76 |         1.72 |         1.85 |
| bal -f data/10000x10000x10.journal ||         3.18 |         3.38 |         3.58 |
| bal -f data/100000x1000x10.journal ||        18.69 |        20.74 |        19.79 |
+------------------------------------++--------------+--------------+--------------+

real    1m16.674s
user    1m10.822s
sys 0m3.867s

and more importantly, the heap profile looks the same:

0.25.1: hledgerprof 0 25 1

0.26: hledgerprof 0 26

What special features does your data have ? Maybe an unusually high number of aliases, accounts, includes, or something else ? Would it be possible to simplify it until you see the problem go away ?

simonmichael commented 9 years ago

Actually, I meant to show profiles with 100k transactions, not 10k. But the main point is I see no change since 0.25, so I'm not sure why you would see it suddenly break with 0.26, unless it's borderline.

Here's one with 100k transactions: hledgerprof-0 26-100k I'm still learning about heap profiles - this one goes up to 600M, but I observed the process using 2.1G in activity monitor. So indeed, it might be that your 32-bit build is hitting its 2G limit.

MarkusTeufelberger commented 9 years ago

Maybe it was/is borderline? After all most transactions were added recently and that might have tipped it over the edge no matter what? hledger.exe uses about 2 GiB of RAM and then dies OOM. I highly suspect that it is compiled to a 64bit binary in Linux...

There are about a dozen files alltogether (so about a dozen includes), nearly all transactions are very simple (I don't duplicate amounts though, so I generally only have one line with the amount and the second line has to auto-balance). As you can see there are a few hundred accounts and a few of the transactions (e.g. PayPal ones) have payment IDs added as tags ("date (payment_id) description").

While it might be possible to simplify it (e.g. process data for one day, write it out, let (h)ledger/beancount process it and only store a single total transaction instead of one transaction per line), this would make the process much slower, the resulting file potentially more error prone and also potentially more incorrect. I'd prefer if it were possible to just take a CSV file, run my parser and have a text file ready for *ledger without adding "intelligence".

By the way, using >2 GB of memory on a <40 MB input consisting only of dates, amounts, descriptions and accounts seems a bit excessive...?

simonmichael commented 9 years ago

Yes that seems to be it. hledger installed with the 32-bit windows version of stack dies with "out of memory" as soon as it gets near 2G. hledger installed with the 64-bit windows stack works though (at least it has been using 3G on this 4G machine for 20m on a million-transaction file and I assume it would finish eventually). I recommend installing that way for now.

And yes, it's using much more memory than it should. We haven't paid attention to that until now, thanks for highlighting it.

simonmichael commented 9 years ago

I tried strictifying and unpacking fields in the main types (at last), just to see if there's any cheap improvement, but not really. (It's a small improvement, worth keeping I think even though I don't know what other effects this change could have):

Strictified: hledgerprof-0 26-strictified Unpacked: hledgerprof-0 26-unpacked

simonmichael commented 8 years ago

I have my suspicions about this loop, if anyone would care to look at this.

We parse the journal in a strange way, I think just for historical reasons - as a sequence of journal updates which we then apply one after the other. I think instead we should parse to some kind of AST, which among other things should make debugging the parser much easier.

simonmichael commented 8 years ago

Notes from #haskell:

<sm> I think I have a space leak of some kind here, can you tell ? http://lpaste.net/142800
<sm> heap profile at https://github.com/simonmichael/hledger/issues/275#issuecomment-147257314
<Gurkenglas> sm, would you try "ala Endo foldMap" instead of "foldl' (\acc new x -> new (acc x)) id" and see if that's better?
<sm> Gurkenglas: how's that ?                                   [17:12]
<Gurkenglas> "ala Endo foldMap" means "runEndo . foldMap Endo"  [17:13]
<sm> hmm I'll look at that, thanks
<lispy> sm: foldl' can only be strict if the function you give it is strict
<lispy> sm: And I suspect that's not the case here
<sm> lispy: probably not                                        [17:15]
<lispy> looking at the github issue now                         [17:16]
<Gurkenglas> (What I want to say is 'also consider "combineJournalUpdates = execStateT . traverse modifyT" with strict StateT', but modifyT isn't in the libraries.)
<Gurkenglas> I hereby submit a pull request request to get modifyT and getsT added to them, which I haven't done myself yet because there are two libraries that we might want to add it to in some order and I$
<Gurkenglas> (Afterwards, isos for modifyT and getsT and modify and gets so we can have "combineJournalUpdates = au _ModifiedT traverse")
<Gurkenglas> Although I guess the one for modify is already provided by Endo
<lispy> sm: have you tried making combineJournalUpdates lazier?
<lispy> sm: I don't know if sequence returns results lazily or not in this case, but it might be building the whole list ( [Journal -> Journal] ) and feeding that to foldl' at once.
<lispy> sm: also, can you upload your profiling data here: http://heap.ezyang.com/
<sm> lispy: I haven't, yet. All of these comments will be quite helpful when I next work on it. I'll do that and report back
<sm> lispy, Gurkenglas: thanks!
MarkusTeufelberger commented 8 years ago

Any news on this? 64-bit hledger 0.26 meanwhile also fails to parse a meager ~65 MiB ledger file with 16 GB of RAM...

simonmichael commented 8 years ago

No news. How many transactions in that file ?

MarkusTeufelberger commented 8 years ago

A bit more than half a million (about 520k).

simonmichael commented 8 years ago

Thanks for the data point, and I'm sorry I don't have more time to work on it right now - all help welcome.

MarkusTeufelberger commented 8 years ago

I'd give it a try myself, but Haskell is unfortunately not a language I have any experience in. :-(

It is a mystery to me why it should scale so poorly or waste that much memory. Isn't that an issue/concern with smaller input files too (getting the balance of ~1100 transactions already takes a noticeable amount of time - less than a second though) or are typical computers just too fast for that to register?

simonmichael commented 8 years ago

It's simply using the easy but non-optimal data types and algorithms that were easiest for me to get working. (I won't say specifically what's to blame, since I don't know for sure.) Nowadays we have Text instead of String, newer parser libraries (we have an unfinished port to megaparsec in the PR queue), better profiling tools etc., but fixing it is a non-trivial time investment.

It hasn't been top of my queue because I split files by year, so my typical queries are against < 1k transactions and run in less than a second on a macbook. I spend far more time entering data, creating reports, staring at the results, searching in the editor etc. than waiting for reports to complete. I wasn't even particularly aware of it until somebody (you) reported it as a problem (thanks!). It is definitely a goal to see how much more efficient we can get on large data sets.

simonmichael commented 8 years ago

hledger now mostly uses Text instead of String, which has reduced memory usage by ~30% with large files. For the record I think there is no space leak here, rather it's that our nice intuitive in-memory representation of the journal and transactions is quite large and redundant.

simonmichael commented 8 years ago

Latest heap profile. More allocation, but lower maximum residency. (And slower according to this chart, but benchmarks say it's actually a little faster. Odd.) hledger-f3bf98b-201605262149

simonmichael commented 1 year ago

7 years later: some current measurements, on a 16G macbook air m1, using GHC 9.4.5:

$ hledger-1.30.1 --version
hledger 1.30.1, mac-aarch64
$ make samplejournals
make: Nothing to be done for 'samplejournals'.

10K transactions

1s run time, 35Mb RTS max residency, 125Mb Real Mem observed in mac activity monitor / top.

$ hledger-1.30.1 -f examples/10000x1000x10.journal stats +RTS -s
Main file                : /Users/simon/src/hledger/examples/10000x1000x10.journal
Included files           : 
Transactions span        : 2000-01-01 to 2027-05-19 (10000 days)
Last transaction         : 2027-05-18 (1428 days from now)
Transactions             : 10000 (1.0 per day)
Transactions last 30 days: 31 (1.0 per day)
Transactions last 7 days : 8 (1.1 per day)
Payees/descriptions      : 10000
Accounts                 : 1000 (depth 10)
Commodities              : 26 (A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z)
Market prices            : 10000 (A)

Run time (throughput)    : 0.86s (11650 txns/s)
   3,643,282,760 bytes allocated in the heap
     216,941,448 bytes copied during GC
      34,589,680 bytes maximum residency (9 sample(s))
         883,416 bytes maximum slop
              97 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       873 colls,     0 par    0.078s   0.084s     0.0001s    0.0023s
  Gen  1         9 colls,     0 par    0.056s   0.167s     0.0186s    0.1067s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.009s elapsed)
  MUT     time    0.559s  (  0.610s elapsed)
  GC      time    0.134s  (  0.251s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time    0.693s  (  0.876s elapsed)

  Alloc rate    6,515,076,250 bytes per MUT second

  Productivity  80.7% of total user, 69.6% of total elapsed

100K transactions

7s, 260Mb, 760Mb.

$ hledger-1.30.1 -f examples/100000x1000x10.journal stats +RTS -s
Main file                : /Users/simon/src/hledger/examples/100000x1000x10.journal
Included files           : 
Transactions span        : 2000-01-01 to 2273-10-16 (100000 days)
Last transaction         : 2273-10-15 (91428 days from now)
Transactions             : 100000 (1.0 per day)
Transactions last 30 days: 31 (1.0 per day)
Transactions last 7 days : 8 (1.1 per day)
Payees/descriptions      : 100000
Accounts                 : 1000 (depth 10)
Commodities              : 26 (A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z)
Market prices            : 100000 (A)

Run time (throughput)    : 6.58s (15208 txns/s)
  34,563,365,168 bytes allocated in the heap
   2,198,554,992 bytes copied during GC
     257,165,456 bytes maximum residency (10 sample(s))
       6,539,120 bytes maximum slop
             735 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      8361 colls,     0 par    0.834s   0.858s     0.0001s    0.0042s
  Gen  1        10 colls,     0 par    0.449s   0.527s     0.0527s    0.1591s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    5.129s  (  5.200s elapsed)
  GC      time    1.283s  (  1.385s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    6.412s  (  6.593s elapsed)

  Alloc rate    6,738,312,460 bytes per MUT second

  Productivity  80.0% of total user, 78.9% of total elapsed

1M transactions

80s, 2.5G, 5G.

$ hledger-1.30.1 -f examples/1000000x1000x10.journal stats +RTS -s
Main file                : /Users/simon/src/hledger/examples/1000000x1000x10.journal
Included files           : 
Transactions span        : 2000-01-01 to 4737-11-28 (1000000 days)
Last transaction         : 4737-11-27 (991428 days from now)
Transactions             : 1000000 (1.0 per day)
Transactions last 30 days: 31 (1.0 per day)
Transactions last 7 days : 8 (1.1 per day)
Payees/descriptions      : 1000000
Accounts                 : 1000 (depth 10)
Commodities              : 26 (A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z)
Market prices            : 1000000 (A)

Run time (throughput)    : 80.43s (12433 txns/s)
 344,308,904,560 bytes allocated in the heap
  22,690,180,960 bytes copied during GC
   2,571,078,552 bytes maximum residency (16 sample(s))
     162,602,168 bytes maximum slop
            7230 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     83315 colls,     0 par    9.116s  11.044s     0.0001s    0.0246s
  Gen  1        16 colls,     0 par    4.794s  11.327s     0.7079s    3.7722s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time   51.557s  ( 58.484s elapsed)
  GC      time   13.910s  ( 22.370s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time   65.467s  ( 80.865s elapsed)

  Alloc rate    6,678,243,111 bytes per MUT second

  Productivity  78.8% of total user, 72.3% of total elapsed
simonmichael commented 1 year ago

So roughly speaking, hledger memory use scales linearly with the number of transactions being processed, with heap usage reaching 5K x number of transactions. Eg, for ten thousand transactions, 50 megabytes of heap; for one million transactions, ~5 gigabytes of heap.

If you build hledger with --ghc-options=-rtsopts and run it with eg +RTS -M3G to limit the heap usage, it runs long and eventually fails with "Heap exhaused".