lpeterse / haskell-socket

A Haskell binding to the POSIX sockets API
MIT License
47 stars 10 forks source link

Increased memory consumption with socket-0.8.0.1 compared to 0.8.x.x branch #48

Open SX91 opened 7 years ago

SX91 commented 7 years ago

My program consumed upto 16Mb compiled with socket lib from branch 0.8.x.x. Now it uses 25Mb (or even more). Compiled with GHC-8.0.2, LTS-8.3.

Heap profile for socket-0.8.x.x: socket-heap-0 8 x x

Heap profile for socket-0.8.0.1: socket-heap-0 8 0 1

Sorry, no test case this time.

lpeterse commented 7 years ago

Mhh, interesting. Could your recreate the heap profiles for these 3 commits?

fafb1926b0c5c602ed90dc494158d6c46ce336a7 (mallocBytes) 8f341b1a51a7d1300a5e12b54c880a86c9e181c2 (mallocPlainForeignPtr) bade04065eaadc1a701d2b67cd928ec289226fd9 (ByteString.createUptoN)

@VyacheslavHashov proposed to change the memory allocation mechanism and I refactored it afterwards. It's very likely the culprit lies there.

Apart from that: The charts are very spiky. Is it possible that its just the GC that runs less often? Did you notice a change in performance in your application?

VyacheslavHashov commented 7 years ago

mallocBytes is just a wrapper around C malloc function that allocates memory off-heap. Consequently, this memory is not shown in the heap profile.

lpeterse commented 7 years ago

@VyacheslavHashov You're absolutely right. Probably not an issue then.

SX91 commented 7 years ago

It doesn't show up in heap profile, but as I said, RES memory of my program increases from 16Mb to 25Mb just by compiling it with newer socket version.

Probably that's due to malloc allocated memory gets released as soon as possible, unlike heap allocated (pinned) memory. But well, 56% RES memory increase is a lot, isn't it?

SX91 commented 7 years ago

socket-0.8.0.1:

  19,008,660,936 bytes allocated in the heap
   1,543,782,704 bytes copied during GC
       9,153,928 bytes maximum residency (952 sample(s))
         678,984 bytes maximum slop
              25 MB total memory in use (3 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     21683 colls, 21683 par    6.631s   2.060s     0.0001s    0.0115s
  Gen  1       952 colls,   951 par    3.720s   1.052s     0.0011s    0.0038s

  Parallel GC work balance: 12.06% (serial 0%, perfect 100%)

  TASKS: 17 (1 bound, 16 peak workers (16 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.003s elapsed)
  MUT     time   41.177s  (177.385s elapsed)
  GC      time    8.422s  (  2.607s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    1.929s  (  0.506s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   51.533s  (179.995s elapsed)

  Alloc rate    461,627,435 bytes per MUT second

  Productivity  79.9% of total user, 98.3% of total elapsed

gc_alloc_block_sync: 106881
whitehole_spin: 0
gen[0].sync: 6058
gen[1].sync: 35403

socket-0 8 0 1

socket-0.8.x.x:

  17,841,885,664 bytes allocated in the heap
   1,600,537,528 bytes copied during GC
       2,891,128 bytes maximum residency (1034 sample(s))
         130,080 bytes maximum slop
              11 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     21484 colls, 21484 par    6.608s   2.069s     0.0001s    0.0016s
  Gen  1      1034 colls,  1033 par    3.910s   1.138s     0.0011s    0.0036s

  Parallel GC work balance: 12.09% (serial 0%, perfect 100%)

  TASKS: 18 (1 bound, 16 peak workers (17 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time   41.299s  (177.306s elapsed)
  GC      time    8.537s  (  2.687s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    1.981s  (  0.520s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   51.821s  (179.996s elapsed)

  Alloc rate    432,017,544 bytes per MUT second

  Productivity  79.7% of total user, 98.2% of total elapsed

gc_alloc_block_sync: 116632
whitehole_spin: 0
gen[0].sync: 4744
gen[1].sync: 30230

socket-0 8 x x

qnikst commented 7 years ago

(Sorry for jumping in again) Hello @SX91, thanks for heads-up!

This is looks suspicious, but it would be nice if you provide your program or more information about this issue, otherwise it's easy to get false assumptions. Could you kindly provide us information about the system-view on the resource usage? You can get it by running \time -v ./yourprogram. \ is needed so you run gnu time and not bash builtin (in gentoo package sys-process/time homepage is https://www.gnu.org/directory/time.html I hope it will help you to find the program). Also I see that GC time grow much, and this is a terrible sing, however, afaiu you run profiled version, can you build package without profiling support and run it with +RTS -s and provide GC stats. As an option can you show your program or minimal example so we could analyze it.

Thanks!

SX91 commented 7 years ago

socket-0.8.0.1:

❯ timeout -s INT 180 /usr/bin/time myprog +RTS -maxN2 -s -c -w -qa -qi3 -A2m -M150m -K1k
   9,669,582,064 bytes allocated in the heap
     397,211,104 bytes copied during GC
       8,057,824 bytes maximum residency (478 sample(s))
         770,592 bytes maximum slop
              28 MB total memory in use (6 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2719 colls,     0 par    0.877s   0.947s     0.0003s    0.0013s
  Gen  1       478 colls,     0 par    0.820s   0.843s     0.0018s    0.0054s

  Parallel GC work balance: -nan% (serial 0%, perfect 100%)

  TASKS: 9 (1 bound, 8 peak workers (8 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    7.138s  (178.194s elapsed)
  GC      time    1.696s  (  1.790s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    8.836s  (179.986s elapsed)

  Alloc rate    1,354,579,160 bytes per MUT second

  Productivity  80.8% of total user, 99.0% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
8.83user 2.23system 3:00.00elapsed 6%CPU (0avgtext+0avgdata 31912maxresident)k
328inputs+0outputs (3major+9151minor)pagefaults 0swaps

socket-0.8.x.x:

❯ timeout -s INT 180 /usr/bin/time myprog +RTS -maxN2 -s -c -w -qa -qi3 -A2m -M150m -K1k
   8,245,124,384 bytes allocated in the heap
     391,462,176 bytes copied during GC
       1,651,208 bytes maximum residency (431 sample(s))
          85,536 bytes maximum slop
              12 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2313 colls,     0 par    0.871s   0.927s     0.0004s    0.0017s
  Gen  1       431 colls,     0 par    0.756s   0.778s     0.0018s    0.0032s

  Parallel GC work balance: -nan% (serial 0%, perfect 100%)

  TASKS: 9 (1 bound, 7 peak workers (8 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    6.957s  (178.118s elapsed)
  GC      time    1.627s  (  1.705s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    8.589s  (179.824s elapsed)

  Alloc rate    1,185,078,666 bytes per MUT second

  Productivity  81.0% of total user, 99.1% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
8.58user 2.13system 3:00.00elapsed 5%CPU (0avgtext+0avgdata 21220maxresident)k
5864inputs+0outputs (25major+10422minor)pagefaults 0swaps

Hello @qnikst! I'm sorry, but now I don't have time to look deeper in this issue or write some test app to illustrate the problem.

Let's say I use UDP socket and use receiveFrom func to repeatedly get a packet with 65Kb (or 1.5Kb) buffer size (packet could be a lot shorter). There're N greenthreads (each with it's own socket) running this concurrently. Something like that.

qnikst commented 7 years ago

Sorry it was my error can you give output of /usr/bin/time -v it return maximum memory being used by the program, this is needed to cross verify RTS results; feel free to copy only relevant things, there will be a lot of info in output. I see that total memory in use grows higher at least from RTS point of view, however now GC time looks comparable, so previously it was largely affected by the profiling build.

SX91 commented 7 years ago

@qnikst okay, please look at https://github.com/SX91/haskell-socket-test/tree/0.8.0.1-memory.

0.8.x.x:

timeout -s INT 180 /usr/bin/time -v test-socket-0.8.x.x 30 +RTS -s -N2           
     444,191,528 bytes allocated in the heap
      16,865,944 bytes copied during GC
       1,114,056 bytes maximum residency (2 sample(s))
          58,680 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       553 colls,   553 par    0.127s   0.115s     0.0002s    0.0005s
  Gen  1         2 colls,     1 par    0.000s   0.001s     0.0003s    0.0003s

  Parallel GC work balance: 10.50% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 8 peak workers (9 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    5.283s  (179.882s elapsed)
  GC      time    0.127s  (  0.116s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time    5.411s  (180.000s elapsed)

  Alloc rate    84,081,320 bytes per MUT second

  Productivity  97.6% of total user, 99.9% of total elapsed

gc_alloc_block_sync: 111
whitehole_spin: 0
gen[0].sync: 3
gen[1].sync: 0
Command terminated by signal 2
    Command being timed: "test-socket-0.8.x.x 30 +RTS -s -N2"
    User time (seconds): 5.41
    System time (seconds): 7.11
    Percent of CPU this job got: 6%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:00.00
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 6376
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 23426
    Voluntary context switches: 253099
    Involuntary context switches: 4169
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

0.8.0.1:

timeout -s INT 180 /usr/bin/time -v test-socket-0.8.0.1 30 +RTS -s -N2
   5,101,627,824 bytes allocated in the heap
      81,955,288 bytes copied during GC
       1,300,008 bytes maximum residency (12 sample(s))
          66,496 bytes maximum slop
               9 MB total memory in use (4 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      8905 colls,  8905 par    0.552s   0.548s     0.0001s    0.0002s
  Gen  1        12 colls,    11 par    0.003s   0.003s     0.0002s    0.0003s

  Parallel GC work balance: 38.25% (serial 0%, perfect 100%)

  TASKS: 9 (1 bound, 8 peak workers (8 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    5.956s  (179.446s elapsed)
  GC      time    0.556s  (  0.551s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    6.512s  (179.999s elapsed)

  Alloc rate    856,576,132 bytes per MUT second

  Productivity  91.5% of total user, 99.7% of total elapsed

gc_alloc_block_sync: 7132
whitehole_spin: 0
gen[0].sync: 428
gen[1].sync: 8
Command terminated by signal 2
    Command being timed: "test-socket-0.8.0.1 30 +RTS -s -N2"
    User time (seconds): 6.51
    System time (seconds): 7.58
    Percent of CPU this job got: 7%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:00.00
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 9876
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1834
    Voluntary context switches: 261833
    Involuntary context switches: 4330
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0