Open navytux opened 7 years ago
In doing this kind of benchmarking, it's very important to remember that on amd64 the Go type int
is 64 bits. Do your timings change if you rewrite your Go programs to use int32
rather than int
?
I'm surprised the pprof output shows mapaccess, not mapaccessfast*.
Good point, thanks for noticing. So with the following program:
package main
//import "fmt"
func main() {
//a := make(map[int]int)
a := make(map[int32]int32)
for i := int32(0); i < 100000000; i++ {
a[i & 0xffff] = i
//a[i & 0x7f] = i
//a[i] = i
}
//fmt.Println(a)
}
--- a/map.go.kirr
+++ b/map.go
@@ -2,9 +2,10 @@ package main
//import "fmt"
func main() {
- a := make(map[int]int)
+ //a := make(map[int]int)
+ a := make(map[int32]int32)
- for i := 0; i < 100000000; i++ {
+ for i := int32(0); i < 100000000; i++ {
a[i & 0xffff] = i
//a[i & 0x7f] = i
//a[i] = i
map.go timings are:
map.go case | Time (seconds, best of 5) |
---|---|
a[i & 0xffff] = i | 3.579 |
a[i & 0x7f] = i | 1.596 |
a[i] = i; niter=1E7 | 2.616 |
in other words it stays the same as with map[int]int
± 2%.
@josharian it is mapassign
. git grep shows there is no mapassign_*
under go/src.
Huh. We have mapaccess1_fast*
and mapaccess2_fast*
, but not mapassign_fast*
. Maybe we should add that. (Perhaps in the process we should automate generating the fast versions, like we automate the generation of Slice routines in package sort.) `mapdelete_fast` too? Any others?
I'd suggest that as a first step for this issue. I have a few other things on my plate I'd like to do first, but if no one speaks up for it, I'll plan to eventually get to this.
How do you suggest to automate generating the fast versions? I can see a Go file, run by go generate, doing the right sequence of printf or template substitutions, is that what you're thinking?
@rasky yeah, something like that. I took a quick look at this earlier and there was more divergence between the regular and the _fast methods than I anticipated. It's probably worth doing one manually and benchmarking the improvements before investing too much time in automating it. Do you plan to work on this?
@josharian I wish, but I don't have enough time in the next two weeks.
@rasky cool. I think @martisch has also been thinking about this. If/when someone starts on this in earnest, just drop a note here to avoid duplicated work. Since it seems there's plenty of other folks interested, I will turn my attention elsewhere.
CL https://golang.org/cl/38091 mentions this issue.
@josharian I have a patchset up that passes tests and shows some improvements on benchmarks. Are you the right person to review this?
name old time/op new time/op delta
MapAssignInt32_255 24.6ns ± 0% 19.5ns ± 0% ~ (p=1.000 n=1+1)
MapAssignInt32_64k 43.9ns ± 0% 37.3ns ± 0% ~ (p=1.000 n=1+1)
MapAssignInt64_255 24.8ns ± 0% 19.7ns ± 0% ~ (p=1.000 n=1+1)
MapAssignInt64_64k 45.5ns ± 0% 37.8ns ± 0% ~ (p=1.000 n=1+1)
MapAssignStr_255 46.3ns ± 0% 25.1ns ± 0% ~ (p=1.000 n=1+1)
MapAssignStr_64k 79.6ns ± 0% 56.8ns ± 0% ~ (p=1.000 n=1+1)
Not quite sure why benchstat doesn't show a significant delta...
@huguesb you need to run each benchmark multiple times for benchstat
to show lower p
values. Try with 5 or 10 runs before and after.
@mvdan Yeah, I screwed up the result concatenation the first time. Here goes
name old time/op new time/op delta
MapAssignInt32_255-8 25.0ns ± 3% 19.0ns ± 4% -24.20% (p=0.000 n=10+10)
MapAssignInt32_64k-8 44.6ns ± 3% 37.6ns ± 3% -15.72% (p=0.000 n=10+9)
MapAssignInt64_255-8 24.8ns ± 4% 18.9ns ± 4% -23.82% (p=0.000 n=10+10)
MapAssignInt64_64k-8 45.2ns ± 1% 38.5ns ± 6% -14.86% (p=0.000 n=9+9)
MapAssignStr_255-8 46.9ns ± 1% 24.6ns ± 4% -47.49% (p=0.000 n=8+10)
MapAssignStr_64k-8 80.9ns ± 4% 53.3ns ± 3% -34.10% (p=0.000 n=10+9)
Please add the numbers to the CL too. Thanks for working on this!
@mvdan Done
@huguesb, @josharian thanks for working on this. After recent patch (ec091b6af2) timings, compared to https://github.com/golang/go/issues/19495#issuecomment-285728916, improve a bit:
map.go case | Time (map[int32]int32) | Time (map[int]int) |
---|---|---|
a[i & 0xffff] = i | 2.921 | 2.991 |
a[i & 0x7f] = i | 1.121 | 1.124 |
a[i] = i; niter=1E7 | 2.438 | 2.563 |
but go is still 3x - 7.5x slower than nodejs.
for map[int32]int32
case profile looks like this:
flat flat% sum% cum cum%
20.60ms 69.36% 69.36% 23ms 77.44% runtime.mapassign_fast32 /home/kirr/src/tools/go/go/src/runtime/hashmap_fast.go
3.60ms 12.12% 81.48% 29.70ms 100% main.main /home/kirr/tmp/trashme/map/map.go
3.10ms 10.44% 91.92% 3.10ms 10.44% runtime.mapassign_fast32 /home/kirr/src/tools/go/go/src/runtime/stubs.go
2.40ms 8.08% 100% 2.40ms 8.08% runtime.aeshash32 /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
0 0% 100% 29.70ms 100% runtime.main /home/kirr/src/tools/go/go/src/runtime/proc.go
@navytux are you sure the js benchmark is correct? The code does not reference any values added to the map during the loop, so a smart compiler will optimise away the whole loop.
@davecheney thanks for asking. Yes, I checked it this way:
console.log(a)
in the end. If it is enabled - the compiler cannot optimize map assignments away, at least unless it is very smart to compute the whole result just from analyzing loop structure. So e.g. for a[i & 0x7f] = i
case with the following patch:--- a/map.js.kirr
+++ b/map.js
@@ -4,12 +4,12 @@ function main() {
var a = {};
for (var i = 0; i < 100000000; i++) {
- a[i & 0xffff] = i;
- //a[i & 0x7f] = i;
+ //a[i & 0xffff] = i;
+ a[i & 0x7f] = i;
//a[i] = i;
}
- //console.log(a)
+ console.log(a)
}
main()
it is:
kirr@deco:~/tmp/trashme/map$ time ./map.js
{ '0': 99999872,
'1': 99999873,
'2': 99999874,
'3': 99999875,
...
'125': 99999997,
'126': 99999998,
'127': 99999999 }
real 0m0,396s
user 0m0,384s
sys 0m0,012s
it is only a bit more than 0.377 we had for this case without console.log()
. To me it tells the map assignments and loop are not optimized away.
Said that I also have to say I'm complete newbie to javascript and can be missing something.
Thanks again, Kirill
Can you try how things change if you manually unroll the loop 8 times on both languages?
I looked around a bit to see how nodejs compiles this (I used nodejs --print-opt-code --code-comments ...
from http://stackoverflow.com/questions/10953173/dump-v8-jit-output-from-node). Showing only the loop itself:
--- Raw source ---
() {
var a = {};
for (var i = 0; i < 100000000; i++) {
//a[i & 0xffff] = i;
a[i & 0x7f] = i;
//a[i] = i;
}
console.log(a)
}
--- Optimized code ---
optimization_id = 0
source_position = 77
kind = OPTIMIZED_FUNCTION
name = main
stack_slots = 7
compiler = crankshaft
Instructions (size = 857)
0x2c5a2f84bee0 0 55 push rbp
0x2c5a2f84bee1 1 4889e5 REX.W movq rbp,rsp
0x2c5a2f84bee4 4 56 push rsi
0x2c5a2f84bee5 5 57 push rdi
0x2c5a2f84bee6 6 4883ec38 REX.W subq rsp,0x38
;;; <@0,#0> -------------------- B0 --------------------
...
;;; <@66,#58> -------------------- B5 (loop header) --------------------
0x2c5a2f84c011 305 4c894db8 REX.W movq [rbp-0x48],r9 ;; debug: position 121
;;; <@69,#62> compare-numeric-and-branch
0x2c5a2f84c015 309 4181f900e1f505 cmpl r9,0x5f5e100
0x2c5a2f84c01c 316 0f8d44000000 jge 390 (0x2c5a2f84c066)
;;; <@70,#63> -------------------- B6 (unreachable/replaced) --------------------
;;; <@74,#69> -------------------- B7 --------------------
;;; <@76,#71> stack-check
0x2c5a2f84c022 322 493ba548080000 REX.W cmpq rsp,[r13+0x848]
0x2c5a2f84c029 329 0f8269010000 jc 696 (0x2c5a2f84c198)
;;; <@77,#71> gap
0x2c5a2f84c02f 335 498bc1 REX.W movq rax,r9
;;; <@78,#75> bit-i
0x2c5a2f84c032 338 83e07f andl rax,0x7f ;; debug: position 182
;;; <@80,#100> smi-tag
0x2c5a2f84c035 341 418bd1 movl rdx,r9
0x2c5a2f84c038 344 48c1e220 REX.W shlq rdx, 32
;;; <@82,#101> smi-tag
0x2c5a2f84c03c 348 8bc8 movl rcx,rax
0x2c5a2f84c03e 350 48c1e120 REX.W shlq rcx, 32
;;; <@83,#101> gap
0x2c5a2f84c042 354 488bf7 REX.W movq rsi,rdi
0x2c5a2f84c045 357 488bc2 REX.W movq rax,rdx
0x2c5a2f84c048 360 488bd3 REX.W movq rdx,rbx
;;; <@84,#78> store-keyed-generic
0x2c5a2f84c04b 363 e81068eeff call KeyedStoreIC_PreMonomorphic (0x2c5a2f732860) ;; code: KEYED_STORE_IC, PREMONOMORPHIC
;;; <@86,#79> lazy-bailout
;;; <@87,#79> gap
0x2c5a2f84c050 368 488b45b8 REX.W movq rax,[rbp-0x48]
;;; <@88,#82> add-i
0x2c5a2f84c054 372 448d4801 leal r9,[rax+0x1] ;; debug: position 135
;;; <@90,#85> gap
0x2c5a2f84c058 376 4c8b45c0 REX.W movq r8,[rbp-0x40]
0x2c5a2f84c05c 380 488b7dc8 REX.W movq rdi,[rbp-0x38]
0x2c5a2f84c060 384 488b5dd0 REX.W movq rbx,[rbp-0x30]
;;; <@91,#85> goto
0x2c5a2f84c064 388 ebab jmp 305 (0x2c5a2f84c011)
;;; <@92,#66> -------------------- B8 (unreachable/replaced) --------------------
;;; <@96,#86> -------------------- B9 --------------------
;;; <@97,#86> gap
0x2c5a2f84c066 390 488b45c8 REX.W movq rax,[rbp-0x38] ;; debug: position 234
...
--- End code ---
it could be seen the loop is there (100000000 = 0x5f5e100), a & 0x7f
is there and map assign seems to be call KeyedStoreIC_PreMonomorphic
. So to me it once again shows things are not optimized away.
@rasky, sure. For the following programs:
package main
//import "fmt"
func main() {
//a := make(map[int]int)
a := make(map[int32]int32)
for i := int32(0); i < 100000000; {
//a[i & 0xffff] = i
//a[i & 0x7f] = i
//a[i] = i
a[i & 0xffff] = i // 0
i++
a[i & 0xffff] = i // 1
i++
a[i & 0xffff] = i // 2
i++
a[i & 0xffff] = i // 3
i++
a[i & 0xffff] = i // 4
i++
a[i & 0xffff] = i // 5
i++
a[i & 0xffff] = i // 6
i++
a[i & 0xffff] = i // 7
i++
}
//fmt.Println(a)
}
#!/usr/bin/env nodejs
function main() {
var a = {};
for (var i = 0; i < 100000000;) {
//a[i & 0xffff] = i;
//a[i & 0x7f] = i;
//a[i] = i;
a[i & 0xffff] = i; // 0
i++;
a[i & 0xffff] = i; // 1
i++;
a[i & 0xffff] = i; // 2
i++;
a[i & 0xffff] = i; // 3
i++;
a[i & 0xffff] = i; // 4
i++;
a[i & 0xffff] = i; // 5
i++;
a[i & 0xffff] = i; // 6
i++;
a[i & 0xffff] = i; // 7
i++;
}
//console.log(a)
}
main()
timings are:
Program | Time (seconds, best of 5) |
---|---|
map-unroll.go | 2.938 |
map-unroll.js | 0.307 |
and just in case if I uncomment console.log(a)
in map-unroll.js it becomes:
--- a/map-unroll.js.kirr
+++ b/map-unroll.js
@@ -26,7 +26,7 @@ function main() {
i++;
}
- //console.log(a)
+ console.log(a)
}
main()
kirr@deco:~/tmp/trashme/map$ ./map-unroll.js |head
{ '0': 99942400,
'1': 99942401,
'2': 99942402,
'3': 99942403,
'4': 99942404,
'5': 99942405,
'6': 99942406,
'7': 99942407,
'8': 99942408,
'9': 99942409,
events.js:141
throw er; // Unhandled 'error' event
^
Error: write EPIPE
at exports._errnoException (util.js:907:11)
at WriteWrap.afterWrite (net.js:787:14)
kirr@deco:~/tmp/trashme/map$ time ./map-unroll.js >/dev/null
real 0m0,501s
user 0m0,496s
sys 0m0,008s
so it is 0.501s vs 0.307s but there are 64K entries to print and imho it is reasonable to expect it to add time. Still even 0.5s vs 2.9s (and go case does not print anything) is a big difference.
Thanks, Kirill
CL https://golang.org/cl/38172 mentions this issue.
I've been doing some profiling and it looks like a good chunk of time is spent iterating inside buckets. My feeling is that the memory layout is distinctly suboptimal because every insert needs to check all cells of the matching bucket in case the matching entry is at the end of the bucket. This implies a constant factor increase up to 8x. Assuming keys are evenly distributed inside buckets, which seems reasonable in this test case, I would expect an average of 4 probes per insert, which seems roughly in line with the observed performance difference with other hash tables (which I suspect use some variant of open addressing and do much less probing in practice).
I'm going to experiment with a few micro-optimizations but I think eventually we'll have to use a different memory layout to get significant improvements.
In the common case you only need to probe the 8 tophash bytes; you don't need to look at the key and value slots themselves. Is is really the case that
for i := 0; i < 8; i++ {
if b.tophash[i] == x {
.. unlikely if value is not already in map ..
}
}
takes a long time?
ROUTINE ======================== runtime.mapassign_fast64 in /Users/hugues/go/src/runtime/hashmap_fast.go
19.60ms 23.20ms (flat, cum) 74.84% of Total
. . 508: }
. . 509: h.flags &^= hashWriting
. . 510: return val
. . 511:}
. . 512:
700us 700us 513:func mapassign_fast64(t *maptype, h *hmap, key uint64) unsafe.Pointer {
...
2.40ms 6ms 524: hash := t.key.alg.hash(noescape(unsafe.Pointer(&key)), uintptr(h.hash0))
...
1.30ms 1.30ms 548: for i := uintptr(0); i < bucketCnt; i++ {
9.40ms 9.40ms 549: if b.tophash[i] != top {
1.20ms 1.20ms 550: if b.tophash[i] == empty && inserti == nil {
Maybe that branch is systematically mispredicted, leading to lots of stalls.
I suspect that node/python are switching the implementation based on the data being stored. For example here alpha
takes 0.4ms and beta
10.7ms:
function alpha() {
var a = {};
a[0xffff] = 10;
for (var i = 0; i < 100000000; i++) {
a[i & 0xffff] = i;
}
}
function beta() {
var a = {};
a[0xfffffff] = 10;
for (var i = 0; i < 100000000; i++) {
a[i & 0xffff] = i;
}
}
And as a proof-of-concept, Go implementation that achieves similar characteristics https://play.golang.org/p/7y3SVXAtA2. I'm not sure whether this is the same that V8 does, but I am guessing it is something similar.
@egonelbre thanks for noticing this. Some relevant links about nodejs:
https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.h#L3519 https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.cc#L14934 https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.cc#L12440 https://github.com/nodejs/node/blob/v4.x/deps/v8/src/objects.cc#L12499 https://github.com/nodejs/node/blob/v4.x/deps/v8/src/elements-kind.h#L13 (I've checked nodejs v4.x branch since it matches nodejs which is currently packaged by Debian)
The pypy is not switching dict implementation this way - timing for the following program stays the same irregardless of whether high index preset is enabled or not:
#!/usr/bin/env pypy
def main():
a = {}
if 1:
# set high index to make sure dict implementation is hash-table, not array
a[0x100000000] = 10
a[1<<63 - 1] = 10
for i in range(100000000):
a[i & 0xffff] = i
#a[i & 0x7f] = i
#a[i] = i
print(a)
if __name__ == '__main__':
main()
it is:
$ ./map.py |head -c80
{4294967296: 10, 4611686018427387904: 10, 0: 99942400, 1: 99942401, 2: 99942402,Traceback (most recent call last):
File "./map.py", line 20, in <module>
main()
File "./map.py", line 17, in main
print(a)
IOError: [Errno 32] Broken pipe: '<fdopen>'
$ time ./map.py >/dev/null
real 0m2,005s
user 0m1,884s
sys 0m0,116s
The go version:
package main
//import "fmt"
func main() {
a := make(map[int32]int32)
for i := int32(0); i < 100000000; i++ {
a[i & 0xffff] = i
//a[i & 0x7f] = i
//a[i] = i
}
//fmt.Println(a)
}
Still takes ~1.4x - 1.5x longer compared to pypy:
$ go build map2.go
$ time ./map2 # best of 5
real 0m2,887s
user 0m2,884s
sys 0m0,004s
(and pypy/nodejs programs do printing, while go does not - with printing go time increases to ~ 3.0s)
Pypy dict related links:
https://bitbucket.org/pypy/pypy/src/9d2b5856f4c3b3ff1f6b2098afd4de299aaaa9cf/rpython/rtyper/lltypesystem/rdict.py?at=default&fileviewer=file-view-default#rdict.py-18 https://bitbucket.org/pypy/pypy/src/9d2b5856f4c3b3ff1f6b2098afd4de299aaaa9cf/pypy/objspace/std/dictmultiobject.py?at=default&fileviewer=file-view-default#dictmultiobject.py-1270 https://bitbucket.org/pypy/pypy/src/9d2b5856f4c3b3ff1f6b2098afd4de299aaaa9cf/pypy/objspace/std/dictmultiobject.py?at=default&fileviewer=file-view-default#dictmultiobject.py-676
@huguesb @josharian Why guess when you can measure? :)
If you're on linux and have a relatively modern Intel chip (or something with a decent PMU), you can run perf record -e cache-misses:pp -e cycles:pp
and perf record -e branch-misses:pp -e cycles:pp
on your executable (followed by perf report
) and see which event a) occurs more frequently, and b) which one consumes more cycles. May also be worth looking at dtlb_load_misses
as well.
@philhofer if, say, you're typing on your phone or your only dev machine runs OS X. Or both, as it happens. :) But I'd love to hear the results!
Yes, this appears to be a dcache effect, not a branch misprediction. (Surprisingly, the branch in question doesn't show up at all when measuring mispredictions.)
Event counts for cache-misses:pp
(you have to squint a little bit in order to compensate for the interrupt skid). I see very similar results for dTLB-load-misses
.
│ for { ▒
│ for i := uintptr(0); i < bucketCnt; i++ { ▒
0.90 │ d4: cmp ip, #8 ▒
│ ↓ bcs 1a0 ▒
│ if b.tophash[i] != top { ▒
│ ldrsb fp, [r2] ▒
│ add lr, r2, ip ◆
2.23 │ ldrb r0, [lr] ▒
│ and r3, r5, #255 ▒
2.51 │ cmp r3, r0 ▒
│ ↓ beq 13c ▒
│ if b.tophash[i] == empty && inserti == nil { ▒
│ cmp r0, #0 ▒
26.86 │ ↓ bne 104 ▒
│ cmp r7, #0 ▒
│ ↓ beq 114 ▒
│ for i := uintptr(0); i < bucketCnt; i++ { ▒
│104: add ip, ip, #1 ▒
│ bucket := hash & (uintptr(1)<<h.B - 1) ▒
3.13 │ ldr r0, [sp, #28] ▒
│ mov r3, #1 ▒
│ for i := uintptr(0); i < bucketCnt; i++ { ▒
0.07 │ ↑ b d4
The hardware I'm testing on has a particularly anemic TLB (12 entries!), so it tends to exaggerate these sorts of effects.
On an unrelated note, the instruction sequence
ldr r0, [sp, #28]
mov r3, #1
shouldn't be in that loop body at all.
Relevant perf output for cache-misses on amd64 (Intel Core i7-3610QM from 2012) . Not quite sure what to make of it.
│ var inserti *uint8 ▒
│ var insertk unsafe.Pointer ▒
│ var val unsafe.Pointer ▒
│ for { ◆
│ for i := uintptr(0); i < bucketCnt; i++ { ▒
2.02 │ e5: cmp $0x8,%r11 ▒
│ ↓ jae 192 ▒
│ if b.tophash[i] != top { ▒
0.26 │ test %al,(%rdx) ▒
24.06 │ movzbl (%rdx,%r11,1),%r12d ▒
1.16 │ lea (%rdx,%r11,1),%r13 ▒
2.43 │ cmp %r12b,%al ▒
│ ↓ je 126 ▒
│ if b.tophash[i] == empty && inserti == nil { ▒
0.90 │ test %r12b,%r12b ▒
│ ↓ jne 109 ▒
│ test %rcx,%rcx ▒
│ ↓ je 10e
That looks consistent with the theory that the load generated for b.tophash[i] != top
is often expensive. Since tophash
is only 8 bytes, I think it's fair to say that the cache miss occurs when walking from one bucket to its overflow bucket.
FWIW, I think this is more or less what you would expect: one mandatory cache miss per bucket scanned. Even worse is the fact that the buckets are far apart (for large maps), so you may take a TLB miss on top of your cache miss.
the cache miss occurs when walking from one bucket to its overflow bucket.
If that's the case, it might be interesting to experiment with altering loadFactor in runtime/hashmap.go. And also to instrument a little to understand how much of the time the map spends in a getting-bigger growth state and a same-size growth state. I doubt there are any silver bullets, but it might be worth looking.
Any solution or update to this issue?
Funny I met this problem while I was doing a simple leetcode program and the thought came to me that it might be interesting to compare the results of python and go, and I was scared by the result.
I don't think there have been any significant changes since the mapassign_fast*
functions went in (March 2017?).
Change https://golang.org/cl/115616 mentions this issue: runtime: use indexByte in mapassign_fast32
Change https://golang.org/cl/118358 mentions this issue: runtime: vectorized map bucket lookup
Change https://golang.org/cl/142437 mentions this issue: runtime: exit early when scanning map buckets
any update? I met this problem while I was rewriting a NodeJS project with golang, and found out that the performance of map in golang is much slower than NodeJS one
@NeoyeElf focusing on microbenchmarks can be misleading. Have you found the performance of the application as a whole, specifically request latency and throughput to be worse than your existing NodeJS codebase?
@davecheney My project will get&put data like spreadsheet which has row, columns. In NodeJS, the data structure is like:
dataTable: { row: { column: cellValue }}
. In Golang, I use map like this: map[int32]map[int32]*CellValue
. When there are 100000 cells, the speed for getting cellValue in Golang is much slower than NodeJS, and it matters.
btw: I change the data structure to slice, and the speed is expected and acceptable. Maybe when the key is consecutive numbers, the object in NodeJS is implemented by array as well...just guess :)
@davecheney My project will get&put data like spreadsheet which has row, columns. In NodeJS, the data structure is like: dataTable: { row: { column: cellValue }}. In Golang, I use map like this: map[int32]map[int32]*CellValue. When there are 100000 cells, the speed for getting cellValue in Golang is much slower than NodeJS, and it matters.
Can you provide a benchmark to support this?
@davecheney The Golang program:
func main() {
m := make(map[int32]int32)
for i := int32(0); i < 1000000; i++ {
m[i] = i
}
tmp := int32(0)
start := time.Now()
for i := int32(0); i < 1000000; i++ {
tmp = m[i]
}
fmt.Println("cost time:", time.Now().Sub(start).Milliseconds())
fmt.Println(tmp)
}
it cost about 60ms.
The NodeJS program:
const m = {}
for (let index = 0; index < 1000000; index++) {
m[index] = index
}
let tmp
console.time('11')
for (let index = 0; index < 1000000; index++) {
tmp = m[index]
}
console.timeEnd('11')
console.log(tmp)
it costs about 2.6ms.
These programs above was running on my mac with 2.2 GHz 4 Intel Core i7, 16GB 1600 MHz DDR3
I'm not terribly surprised that a consecutive number optimization would help in cases like this. Instead of iterating to 1e6 by 1, try iterating to 1e9 by 1000. See if that changes anything.
Anyone have the pointer to the nodeJS map implementation source?
I'm not terribly surprised that a consecutive number optimization would help in cases like this. Instead of iterating to 1e6 by 1, try iterating to 1e9 by 1000. See if that changes anything.
In this case, the Golang is about 2 times faster than NodeJS (63ms vs 128ms).
Anyone have the pointer to the nodeJS map implementation source?
I found the post fast properties in V8. It solved my question
Anyone have the pointer to the nodeJS map implementation source?
@randall77 I believe it is in the v8 project in C++ and in these files:
@indutny please help correct me if am wrong about, or missing any references. Thank you.
In this case, the Golang is about 2 times faster than NodeJS (63ms vs 128ms).
Ok, then I'm not really sure that there's anything to fix here. Go's maps are general purpose. For the case of known dense integer indexes, we could have a special case map implementation. But that's a lot of complexity to add for a speedup in a fairly special case (and one for which a slice would quite possibly work just as well).
V8 (in Node.js) has two separate dictionary modes for numeric and non-numeric keys in the map, i.e. if you store values by consecutive numerical keys in the {}
(object) - V8 would recognize it and back the object with a list in the heap.
Please answer these questions before submitting your issue. Thanks!
What did you do?
Hello up there. Map performance was already discussed some time ago in #3885 and improved a bit. It was also said there that the map algorithm is choosen to work very well with very very large maps. However maps are not always very very large and imho in many practical cases they are small and medium.
So please consider the following 3 programs:
(https://play.golang.org/p/rPH1pSM1Xk)
The time it takes to run them on i7-6600U is as follows:
The go version is 9.5x slower than javascript one, and ~ 1.8x slower than pypy one.
If we reduce the actual map size from 64K elements to 128 elements, activating the
a[i & 0x7f] = i
case via e.g. the following patch:timings become:
javascript becomes only a bit faster here while go & pypy improved ~ 2.3x / 2.2x respectively. Still go is 4x slower than javascript and 1.7x slower than pypy.
We can also test how it works if we do not limit the map size and let it grow on every operation. Yes, javascript and pypy are more memory hungry and for original niter=1E8 I'm getting out-of-memory in their cases on my small laptop, but let's test with e.g. niter=1E7 (diff to original program):
timings become:
So it is go/js ~6.5x slower and go/pypy is ~2.2x slower.
The profile for original program (
a[i & 0xffff] = i
) is:What did you expect to see?
Map operations for small / medium maps are as fast or better than in nodejs.
What did you see instead?
Map operations are 4x-10x slower than in javascript for maps sizes that are commonly present in many programs.
Does this issue reproduce with the latest release (go1.8)?
Yes.
System details
Thanks beforehand, Kirill
/cc @rsc, @randall77