nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.41k stars 1.47k forks source link

Excess memory use for `string` with gc:arc #14811

Open c-blake opened 4 years ago

c-blake commented 4 years ago

Example

proc main() =
  var s: string
  for i in 0..100_000_000: s.add('x')
main()

See also discussion in this forum thread. The -f%M option to /usr/bin/time can get you just the max RSS used. Results roughly agree with other ways to assess memory footprint. It is conceivable this behavior is just "not returning memory to the OS" not a more serious true leak hiding the memory from future Nim allocations.

Current Output

Save as str1.nim
$ nim c --gc:arc -d:danger str1.nim
$ /usr/bin/time -f%M ./str1
395244
$ nim c --gc:none -d:danger str1.nim
$ /usr/bin/time -f%M ./str1
319508

That's right gc:arc uses 1.24x more memory than gc:none. Seems bad, at least optics-wise even if somehow internally the memory is not leaked.

Expected Output

Default GC uses 264220 kB, Boehm gets 313960 kB, markAndSweep gets 319516 kB. Not sure if we can expect regular gc:arc to realize the same memory conservative performance as with the destructors run-time, but there's this:

$ nim c --gc:arc -d:useMalloc -d:danger str1.nim
$ /usr/bin/time -f%M ./str1
142936

Possible Solution

Not sure. This could be a tricky one.

Additional Information

$ nim -v
Nim Compiler Version 1.3.5 [Linux: amd64]
Compiled at 2020-06-23
Copyright (c) 2006-2020 by Andreas Rumpf

git hash: 28510a9da9bf2a6b02590ba27b64e951a208b23d
active boot switches: -d:release -d:nimUseLinenoise -d:nativeStackTrace -d:nimHasLibFFI

I actually got similar behavior with a just-post 1.2.0 release, too. I'm not sure if this was ever any better than it is now. Seems likely it was always like this way, though.

As mentioned in the Forum, we should probably have some kind of fuzzy (+- 10%) memory use realized vs. expected memory use regression tests for a few very simple programs like this one where one can almost literally hand-calculate how much memory should be required from growth policies, etc., at least for some collectors. That would allow at least catching any regressions here as soon as they happen.

hashbackup commented 4 years ago

I took a look at the C code for this. From what I can tell, the bigger allocations are done in alloc.nim as "BigChunks" ("HugeChunks" are over 1G). I compiled with nim c -d:danger -d:debugHeapLinks -d:memtracker str1 to enable some debugs in requestOsChunks. I guess this debug code has not been used in a while, because -d:debugHeapLinks causes some compile errors in alloc.nim. I removed the display of heapLink because I think that structure has changed from a simple list to a chunked list with 30 pointer tuples in each chunk and the debug code isn't designed to display that. I also changed from displaying result.size, which is always 0 in this case, to size, which is the requested allocation size.

Here's a run on my system (OSX 10.9 Mavericks) with --gc:arc, first with -d:useMalloc (no debugging since I guess alloc.nim isn't used), and then without -d:useMalloc, showing debug info from alloc.nim:

ms:nim jim$ nim c -d:danger -d:debugHeapLinks -d:memtracker -d:useMalloc --gc:arc str1
Hint: 10606 LOC; 0.254 sec; 10.504MiB peakmem; Dangerous Release build; proj: /Users/jim/nim/str1; out: /Users/jim/nim/str1 [SuccessX]

ms:nim jim$ /usr/bin/time -l ./str1
100000001
        0.66 real         0.60 user         0.05 sys
 146063360  maximum resident set size
     35676  page reclaims
         3  page faults
         1  voluntary context switches
         3  involuntary context switches

ms:nim jim$ nim c -d:danger -d:debugHeapLinks -d:memtracker --gc:arc str1
Hint: 12048 LOC; 0.407 sec; 13.004MiB peakmem; Dangerous Release build; proj: /Users/jim/nim/str1; out: /Users/jim/nim/str1 [SuccessX]

ms:nim jim$ /usr/bin/time -l ./str1
requestOsChunks: owner: 0x10fc33100; result: 0x10fc7f000; size: 524288
requestOsChunks: owner: 0x10fc33100; result: 0x10fd00000; size: 524288
requestOsChunks: owner: 0x10fc33100; result: 0x10fd80000; size: 524288
requestOsChunks: owner: 0x10fc33100; result: 0x10fe00000; size: 753664
requestOsChunks: owner: 0x10fc33100; result: 0x10feb8000; size: 1146880
requestOsChunks: owner: 0x10fc33100; result: 0x10ffd0000; size: 1703936
requestOsChunks: owner: 0x10fc33100; result: 0x110170000; size: 2555904
requestOsChunks: owner: 0x10fc33100; result: 0x1103e0000; size: 3801088
requestOsChunks: owner: 0x10fc33100; result: 0x110780000; size: 5767168
requestOsChunks: owner: 0x10fc33100; result: 0x110d00000; size: 8650752
requestOsChunks: owner: 0x10fc33100; result: 0x111540000; size: 12845056
requestOsChunks: owner: 0x10fc33100; result: 0x112180000; size: 19398656
requestOsChunks: owner: 0x10fc33100; result: 0x113400000; size: 28835840
requestOsChunks: owner: 0x10fc33100; result: 0x114f80000; size: 44040192
requestOsChunks: owner: 0x10fc33100; result: 0x117980000; size: 65011712
requestOsChunks: owner: 0x10fc33100; result: 0x11b780000; size: 98566144
requestOsChunks: owner: 0x10fc33100; result: 0x121580000; size: 146800640
100000001
        0.91 real         0.73 user         0.17 sys
 440188928  maximum resident set size
    107481  page reclaims
         5  page faults
         1  voluntary context switches
         4  involuntary context switches

There is a similar debug display in deallocOsPages, but apparently that isn't getting called. That might explain the excess memory use. Another possibility is that the heapLinks (with an s!) chunked list is not getting set correctly for BigChunk allocations, so the GC doesn't see them.

If I run the same test without --gc:arc and with -d:useMalloc, I get this debug output:

ms:nim jim$ nim c -d:danger -d:debugHeapLinks -d:memtracker -d:useMalloc str1
Hint: 14321 LOC; 0.247 sec; 15.957MiB peakmem; Dangerous Release build; proj: /Users/jim/nim/str1; out: /Users/jim/nim/str1 [SuccessX]

ms:nim jim$ /usr/bin/time -l ./str1
requestOsChunks: owner: 0x1021b3330; result: 0x102209000; size: 524288
requestOsChunks: owner: 0x1021b3330; result: 0x10228a000; size: 524288
requestOsChunks: owner: 0x1021b3330; result: 0x10230a000; size: 524288
requestOsChunks: owner: 0x1021b3330; result: 0x10238a000; size: 589824
requestOsChunks: owner: 0x1021b3330; result: 0x10241a000; size: 884736
requestOsChunks: owner: 0x1021b3330; result: 0x1024f2000; size: 1310720
requestOsChunks: owner: 0x1021b3330; result: 0x102632000; size: 1966080
requestOsChunks: owner: 0x1021b3330; result: 0x102812000; size: 2949120
requestOsChunks: owner: 0x1021b3330; result: 0x102ae2000; size: 4456448
requestOsChunks: owner: 0x1021b3330; result: 0x102f22000; size: 6684672
requestOsChunks: owner: 0x1021b3330; result: 0x103582000; size: 9961472
requestOsChunks: owner: 0x1021b3330; result: 0x103f02000; size: 14942208
requestOsChunks: owner: 0x1021b3330; result: 0x104d42000; size: 22544384
requestOsChunks: owner: 0x1021b3330; result: 0x1062c2000; size: 33554432
requestOsChunks: owner: 0x1021b3330; result: 0x1082c2000; size: 50331648
requestOsChunks: owner: 0x1021b3330; result: 0x10b2c2000; size: 75497472
requestOsChunks: owner: 0x1021b3330; result: 0x10fac2000; size: 113246208
100000001
        0.56 real         0.44 user         0.10 sys
 326635520  maximum resident set size
     79753  page reclaims
        10  page faults
         1  voluntary context switches
         8  involuntary context switches

This seems strange, because I thought with -d:useMalloc I wouldn't see debugs from alloc.nim, like in the first run.

Edit: the other thing I just noticed is that with -d:useMalloc, the allocations are less aggressive, there are the same number of them, and performance for this test is significantly better with -d:useMalloc.

hashbackup commented 4 years ago

Looking at the numbers a bit more, -d:useMalloc uses roughly the same 3:2 allocation strategy, but its 4th allocation is much smaller than alloc.nim, changing the allocation curve quite a bit.

alaviss commented 4 years ago

This issue is a direct consequence of #13203. What happened basically was that the allocator doesn't support realloc, so it emulates this by allocate a new block then copy data over, freeing the old block afterwards.

However, as this seq started out in a small size, the old data blocks doesn't seem to have accumulated into a HugeChunk, which cause them to not be freed and casted aside into the memory reuse pool. How efficient this pool is, I don't know.

A fix right now would probably be to actually implement realloc.

IIRC the algorithm behind alloc.nim is TSLF if anyone wanted to give it a try

Araq commented 4 years ago

IIRC the algorithm behind alloc.nim is TSLF if anyone wanted to give it a try

Correct. And also we should return unused large blocks back to the OS. In the past this was difficult to do, with ARC/ORC it's not a problem anymore.

c-blake commented 4 years ago

Hmm. I see. Whoever does this code may also want to ensure its structure is friendly to some when defined(linux): to use the super fast mremap syscall for chunks larger than 8k or some such limit. (But obviously making it work right for all OSes is the priority.)

alaviss commented 4 years ago

Here's something that might help: https://github.com/mattconte/tlsf

This is an implementation of TSLF licensed under BSD, and they do appear to support realloc.