Open SX91 opened 7 years ago
Mhh, interesting. Could your recreate the heap profiles for these 3 commits?
fafb1926b0c5c602ed90dc494158d6c46ce336a7
(mallocBytes
)
8f341b1a51a7d1300a5e12b54c880a86c9e181c2
(mallocPlainForeignPtr
)
bade04065eaadc1a701d2b67cd928ec289226fd9
(ByteString.createUptoN
)
@VyacheslavHashov proposed to change the memory allocation mechanism and I refactored it afterwards. It's very likely the culprit lies there.
Apart from that: The charts are very spiky. Is it possible that its just the GC that runs less often? Did you notice a change in performance in your application?
mallocBytes
is just a wrapper around C malloc function that allocates memory off-heap. Consequently, this memory is not shown in the heap profile.
@VyacheslavHashov You're absolutely right. Probably not an issue then.
It doesn't show up in heap profile, but as I said, RES memory of my program increases from 16Mb to 25Mb just by compiling it with newer socket
version.
Probably that's due to malloc
allocated memory gets released as soon as possible, unlike heap allocated (pinned) memory. But well, 56% RES memory increase is a lot, isn't it?
socket-0.8.0.1
:
19,008,660,936 bytes allocated in the heap
1,543,782,704 bytes copied during GC
9,153,928 bytes maximum residency (952 sample(s))
678,984 bytes maximum slop
25 MB total memory in use (3 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 21683 colls, 21683 par 6.631s 2.060s 0.0001s 0.0115s
Gen 1 952 colls, 951 par 3.720s 1.052s 0.0011s 0.0038s
Parallel GC work balance: 12.06% (serial 0%, perfect 100%)
TASKS: 17 (1 bound, 16 peak workers (16 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.003s elapsed)
MUT time 41.177s (177.385s elapsed)
GC time 8.422s ( 2.607s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 1.929s ( 0.506s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 51.533s (179.995s elapsed)
Alloc rate 461,627,435 bytes per MUT second
Productivity 79.9% of total user, 98.3% of total elapsed
gc_alloc_block_sync: 106881
whitehole_spin: 0
gen[0].sync: 6058
gen[1].sync: 35403
socket-0.8.x.x
:
17,841,885,664 bytes allocated in the heap
1,600,537,528 bytes copied during GC
2,891,128 bytes maximum residency (1034 sample(s))
130,080 bytes maximum slop
11 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 21484 colls, 21484 par 6.608s 2.069s 0.0001s 0.0016s
Gen 1 1034 colls, 1033 par 3.910s 1.138s 0.0011s 0.0036s
Parallel GC work balance: 12.09% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 16 peak workers (17 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.002s ( 0.002s elapsed)
MUT time 41.299s (177.306s elapsed)
GC time 8.537s ( 2.687s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 1.981s ( 0.520s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 51.821s (179.996s elapsed)
Alloc rate 432,017,544 bytes per MUT second
Productivity 79.7% of total user, 98.2% of total elapsed
gc_alloc_block_sync: 116632
whitehole_spin: 0
gen[0].sync: 4744
gen[1].sync: 30230
(Sorry for jumping in again) Hello @SX91, thanks for heads-up!
This is looks suspicious, but it would be nice if you provide your program or more
information about this issue, otherwise it's easy to get false assumptions.
Could you kindly provide us information about the system-view on the resource usage?
You can get it by running \time -v ./yourprogram
. \
is needed so you run gnu time and
not bash builtin (in gentoo package sys-process/time
homepage is https://www.gnu.org/directory/time.html I hope it will help you to find the program).
Also I see that GC time grow much, and this is a terrible sing, however, afaiu you run
profiled version, can you build package without profiling support and run it with +RTS -s
and provide GC stats. As an option can you show your program or minimal example so
we could analyze it.
Thanks!
socket-0.8.0.1
:
❯ timeout -s INT 180 /usr/bin/time myprog +RTS -maxN2 -s -c -w -qa -qi3 -A2m -M150m -K1k
9,669,582,064 bytes allocated in the heap
397,211,104 bytes copied during GC
8,057,824 bytes maximum residency (478 sample(s))
770,592 bytes maximum slop
28 MB total memory in use (6 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2719 colls, 0 par 0.877s 0.947s 0.0003s 0.0013s
Gen 1 478 colls, 0 par 0.820s 0.843s 0.0018s 0.0054s
Parallel GC work balance: -nan% (serial 0%, perfect 100%)
TASKS: 9 (1 bound, 8 peak workers (8 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 7.138s (178.194s elapsed)
GC time 1.696s ( 1.790s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 8.836s (179.986s elapsed)
Alloc rate 1,354,579,160 bytes per MUT second
Productivity 80.8% of total user, 99.0% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
8.83user 2.23system 3:00.00elapsed 6%CPU (0avgtext+0avgdata 31912maxresident)k
328inputs+0outputs (3major+9151minor)pagefaults 0swaps
socket-0.8.x.x
:
❯ timeout -s INT 180 /usr/bin/time myprog +RTS -maxN2 -s -c -w -qa -qi3 -A2m -M150m -K1k
8,245,124,384 bytes allocated in the heap
391,462,176 bytes copied during GC
1,651,208 bytes maximum residency (431 sample(s))
85,536 bytes maximum slop
12 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2313 colls, 0 par 0.871s 0.927s 0.0004s 0.0017s
Gen 1 431 colls, 0 par 0.756s 0.778s 0.0018s 0.0032s
Parallel GC work balance: -nan% (serial 0%, perfect 100%)
TASKS: 9 (1 bound, 7 peak workers (8 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 6.957s (178.118s elapsed)
GC time 1.627s ( 1.705s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 8.589s (179.824s elapsed)
Alloc rate 1,185,078,666 bytes per MUT second
Productivity 81.0% of total user, 99.1% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
8.58user 2.13system 3:00.00elapsed 5%CPU (0avgtext+0avgdata 21220maxresident)k
5864inputs+0outputs (25major+10422minor)pagefaults 0swaps
Hello @qnikst! I'm sorry, but now I don't have time to look deeper in this issue or write some test app to illustrate the problem.
Let's say I use UDP socket and use receiveFrom
func to repeatedly get a packet with 65Kb (or 1.5Kb) buffer size (packet could be a lot shorter). There're N greenthreads (each with it's own socket) running this concurrently. Something like that.
Sorry it was my error can you give output of /usr/bin/time -v
it return maximum memory being used by the program, this is needed to cross verify RTS results; feel free to copy only relevant things, there will be a lot of info in output. I see that total memory in use grows higher at least from RTS point of view, however now GC time looks comparable, so previously it was largely affected by the profiling build.
@qnikst okay, please look at https://github.com/SX91/haskell-socket-test/tree/0.8.0.1-memory.
0.8.x.x
:
timeout -s INT 180 /usr/bin/time -v test-socket-0.8.x.x 30 +RTS -s -N2
444,191,528 bytes allocated in the heap
16,865,944 bytes copied during GC
1,114,056 bytes maximum residency (2 sample(s))
58,680 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 553 colls, 553 par 0.127s 0.115s 0.0002s 0.0005s
Gen 1 2 colls, 1 par 0.000s 0.001s 0.0003s 0.0003s
Parallel GC work balance: 10.50% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 8 peak workers (9 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.002s elapsed)
MUT time 5.283s (179.882s elapsed)
GC time 0.127s ( 0.116s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 5.411s (180.000s elapsed)
Alloc rate 84,081,320 bytes per MUT second
Productivity 97.6% of total user, 99.9% of total elapsed
gc_alloc_block_sync: 111
whitehole_spin: 0
gen[0].sync: 3
gen[1].sync: 0
Command terminated by signal 2
Command being timed: "test-socket-0.8.x.x 30 +RTS -s -N2"
User time (seconds): 5.41
System time (seconds): 7.11
Percent of CPU this job got: 6%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6376
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 23426
Voluntary context switches: 253099
Involuntary context switches: 4169
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
0.8.0.1
:
timeout -s INT 180 /usr/bin/time -v test-socket-0.8.0.1 30 +RTS -s -N2
5,101,627,824 bytes allocated in the heap
81,955,288 bytes copied during GC
1,300,008 bytes maximum residency (12 sample(s))
66,496 bytes maximum slop
9 MB total memory in use (4 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8905 colls, 8905 par 0.552s 0.548s 0.0001s 0.0002s
Gen 1 12 colls, 11 par 0.003s 0.003s 0.0002s 0.0003s
Parallel GC work balance: 38.25% (serial 0%, perfect 100%)
TASKS: 9 (1 bound, 8 peak workers (8 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.002s elapsed)
MUT time 5.956s (179.446s elapsed)
GC time 0.556s ( 0.551s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 6.512s (179.999s elapsed)
Alloc rate 856,576,132 bytes per MUT second
Productivity 91.5% of total user, 99.7% of total elapsed
gc_alloc_block_sync: 7132
whitehole_spin: 0
gen[0].sync: 428
gen[1].sync: 8
Command terminated by signal 2
Command being timed: "test-socket-0.8.0.1 30 +RTS -s -N2"
User time (seconds): 6.51
System time (seconds): 7.58
Percent of CPU this job got: 7%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9876
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1834
Voluntary context switches: 261833
Involuntary context switches: 4330
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
My program consumed upto 16Mb compiled with
socket
lib from branch 0.8.x.x. Now it uses 25Mb (or even more). Compiled withGHC-8.0.2
,LTS-8.3
.Heap profile for
socket-0.8.x.x
:Heap profile for
socket-0.8.0.1
:Sorry, no test case this time.