Closed hchauvin closed 6 years ago
Hi @hchauvin, thanks for your report! The data race in Flow is by design and is safe (happy to expand on that, but the gist is that the updates to forks are done such that happens-before relationships are established outside of the use of data synchronization, and the writes that are racy are all references.)
However, the first panic is real! I've never seen this. I've looked at the code paths, and I might have some ideas. Can you share the context that you're running this in? Is it a single run? A batch? Is this reproducible or did you see it only once?
Thanks.
Hi,
Thank you for such a quick reply!
It is a single run, but quite a large one (if by batch you mean your batch mode, because this is a definitely a batch-like situation). I tried to reproduce it locally with a "curtailed optimization" mode where I did not optimize fully, but I couldn't get it, probably because there is some rate limiting. The only way I can get it is if I do the whole run remotely on 368 cores on an empty cache. There I get it every single time. Here is the run I just did:
ec2cluster: 8 instances: c5.18xlarge:5,r4.xlarge:1,t2.large:1,t2.medium:1 (<=$17.7/hr), total{mem:709.1GiB cpu:368 disk:2.0TiB intel_avx:368 intel_avx2:364 intel_avx512:360}, waiti
2fe6c367: elapsed: 9m30s, running:3, completed: 732/773
XXX: exec done 65.9KiB 7m50s
XXX: exec done 125.0KiB 7m19s
XXX: exec done 129.0KiB 7m12s
XXX: exec done 126.7KiB 7m12s
XXX: exec ..82de2c015bb10f63b87b8cc41247aecd337304 --output_file={{out}}.--input..op_name=WTC57.--dry_run=FALSE 6s
transfers: done: 4529 1.9GiB, transferring: 0 0B, waiting: 0 0B
ec2-XXX...->ec2-XXX...: done: 1 126.7KiB, transferring: 0 0B, waiting: 0 0B 7s
ec2-XXX...->s3:XXX: done: 1 126.7KiB, transferring: 0 0B, waiting: 0 0B 7s
ec2-XXX...->ec2-XXX...: done: 1 129.0KiB, transferring: 0 0B, waiting: 0 0B 6s
ec2-XXX...->s3:XXX: done: 1 129.0KiB, transferring: 0 0B, waiting: 0 0B 6s
ec2-XXX.eu..->ec2-XXX...: done: 1 65.9KiB, transferring: 0 0B, waiting: 0 0B 1s
ec2-XXX.eu..->s3:XXX: done: 1 65.9KiB, transferring: 0 0B, waiting: 0 0B 1s
ec2-XXX.eu..->ec2-XXX...: done: 1 125.0KiB, transferring: 0 0B, waiting: 0 0B 0s
fatal error: concurrent map read and map write
goroutine 52570 [running]:
runtime.throw(0xf38c2b, 0x21)
GOROOT/src/runtime/panic.go:619 +0x81 fp=0xc423707160 sp=0xc423707140 pc=0x42ceb1
runtime.mapaccess1(0xe30ec0, 0xc420f66ab0, 0xc423707370, 0xc4239e3b30)
GOROOT/src/runtime/hashmap.go:357 +0x216 fp=0xc4237071a8 sp=0xc423707160 pc=0x409906
github.com/grailbio/reflow/values.Map.Insert(0xc420f66ab0, 0x5, 0x6ccea30c1a89f402, 0x577dcc3c9f33ab05, 0xc2de70a08e9e1f64, 0x326543abb206dd76, 0x0, 0x0, 0x0, 0x0, ...)
values/values.go:81 +0x196 fp=0xc4237073f0 sp=0xc4237071a8 pc=0x829256
github.com/grailbio/reflow/syntax.Force.func4()
syntax/force.go:83 +0x13d fp=0xc4237074f0 sp=0xc4237073f0 pc=0x90861d
github.com/grailbio/reflow/syntax.(*resolver).Resolve.func1(0xc422556000, 0x120, 0x120, 0xdae440)
syntax/force.go:197 +0xc6 fp=0xc423707528 sp=0xc4237074f0 pc=0x908996
github.com/grailbio/reflow.(*Flow).canonicalize.func2(0xc422556000, 0x120, 0x120, 0xc422556000)
flow.go:839 +0x57 fp=0xc423707568 sp=0xc423707528 pc=0x85ae57
github.com/grailbio/reflow.(*Eval).eval(0xc4202b7200, 0xff7d60, 0xc420bf6780, 0xc420ea3c00, 0x0, 0x0)
eval.go:1067 +0x14e7 fp=0xc423707f70 sp=0xc423707568 pc=0x842187
github.com/grailbio/reflow.(*Eval).Do.func4(0xc4202b7200, 0xc42074e250, 0xc420ea3c00)
eval.go:459 +0x4f fp=0xc423707fc8 sp=0xc423707f70 pc=0x85a26f
runtime.goexit()
bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc423707fd0 sp=0xc423707fc8 pc=0x45c5f1
created by github.com/grailbio/reflow.(*Eval).Do
eval.go:458 +0x4d5
goroutine 1 [runnable]:
github.com/grailbio/reflow.(*FlowVisitor).Walk(0xc420d60848, 0xc423df8af8)
flow.go:875 +0xe4
github.com/grailbio/reflow.(*Eval).wait(0xc4202b7200, 0xff7d60, 0xc420bf6780, 0x0, 0x0)
eval.go:656 +0x233
github.com/grailbio/reflow.(*Eval).Do(0xc4202b7200, 0xff7d60, 0xc420bf66c0, 0x0, 0x0)
eval.go:484 +0xb46
github.com/grailbio/reflow/runner.(*Runner).Eval(0xc4201b0000, 0xff7e20, 0xc4201bd5c0, 0x60, 0x0, 0x0, 0xc420e8b67b)
runner/runner.go:287 +0x39e
github.com/grailbio/reflow/runner.(*Runner).Do(0xc4201b0000, 0xff7e20, 0xc4201bd5c0, 0x0)
runner/runner.go:204 +0x38f
github.com/grailbio/reflow/tool.(*Cmd).run(0xc42047a000, 0xff7e20, 0xc4201bd5c0, 0xc42001e120, 0x2, 0x2)
tool/run.go:310 +0x12c8
github.com/grailbio/reflow/tool.(*Cmd).Main(0xc42047a000)
tool/main.go:293 +0x9fb
main.main()
cmd/reflow/main.go:62 +0x3fc
goroutine 5 [syscall, 10 minutes]:
os/signal.signal_recv(0x0)
GOROOT/src/runtime/sigqueue.go:139 +0xa6
os/signal.loop()
GOROOT/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.0
GOROOT/src/os/signal/signal_unix.go:28 +0x41
goroutine 18 [select]:
github.com/grailbio/reflow/internal/status.Reporter.displayTerm(0xc4202a2060, 0xff0ca0, 0xc42000e020, 0xc420065be0, 0xc420295ea0)
internal/status/stream.go:118 +0x44b
github.com/grailbio/reflow/internal/status.Reporter.Go(0xc4202a2060, 0xff0ca0, 0xc42000e020, 0xc420295ea0)
internal/status/stream.go:81 +0xaf
created by github.com/grailbio/reflow/tool.(*Cmd).Main
tool/main.go:222 +0x322
goroutine 21 [select, 10 minutes, locked to thread]:
runtime.gopark(0xf52668, 0x0, 0xf1d188, 0x6, 0x18, 0x1)
GOROOT/src/runtime/proc.go:291 +0x11a
runtime.selectgo(0xc42004bf50, 0xc4202a2180)
GOROOT/src/runtime/select.go:392 +0xe50
runtime.ensureSigM.func1()
GOROOT/src/runtime/signal_unix.go:549 +0x1f4
runtime.goexit()
bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/src/runtime/asm_amd64.s:2361 +0x1
goroutine 53 [chan receive, 10 minutes]:
github.com/grailbio/reflow/tool.(*Cmd).Main.func2(0xc420280420, 0xc420287020, 0xc42047a000)
tool/main.go:283 +0x4c
created by github.com/grailbio/reflow/tool.(*Cmd).Main
tool/main.go:282 +0x983
goroutine 68 [select]:
github.com/grailbio/reflow/ec2cluster.(*Cluster).maintain(0xc420e9a1e0)
ec2cluster/ec2cluster.go:474 +0x189
created by github.com/grailbio/reflow/ec2cluster.(*Cluster).Init
ec2cluster/ec2cluster.go:183 +0x54c
goroutine 187 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc702f00, 0x72, 0xc420fbc858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420eb6f98, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc420eb6f98, 0xc422404000, 0x8000, 0x8000)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc420eb6f80, 0xc422404000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc420eb6f80, 0xc422404000, 0x8000, 0x8000, 0xc420fbca38, 0xb23792, 0xe80720)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420f60220, 0xc422404000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc420282f90, 0x7f3db80bb000, 0xc420f60220, 0x5, 0xc420f60220, 0x17)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc4208b1500, 0xf52717, 0xc4208b1620, 0x4)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc4208b1500, 0xc420fb7000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc42099bc80, 0xc420689458, 0x9, 0x9, 0x2, 0xc420fbcc78, 0x405d75)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc42099bc80, 0xc420689458, 0x9, 0x9, 0x9, 0xc42087c9e0, 0xc420fbcd38, 0x405aa7)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc42099bc80, 0xc420689458, 0x9, 0x9, 0x4701e3, 0xc420fbcd18, 0xc420fbcd48)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc420689458, 0x9, 0x9, 0xfeff60, 0xc42099bc80, 0x0, 0x0, 0xc424288601, 0xc420fbce00)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc420689420, 0xc42a7c6510, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc420fbcfb8, 0xf51a90, 0xc4204b37b8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc420ead500)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 69 [select]:
github.com/grailbio/reflow/ec2cluster.(*Cluster).loop(0xc420e9a1e0)
ec2cluster/ec2cluster.go:416 +0x15f7
created by github.com/grailbio/reflow/ec2cluster.(*Cluster).Init
ec2cluster/ec2cluster.go:184 +0x571
goroutine 31857 [IO wait]:
internal/poll.runtime_pollWait(0x7f3db857eea8, 0x72, 0xc421840858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4236ac718, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4236ac718, 0xc421694000, 0x2000, 0x2000)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4236ac700, 0xc421694000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4236ac700, 0xc421694000, 0x2000, 0x2000, 0xc421840a38, 0xb23792, 0xe80720)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420ef82a0, 0xc421694000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc4231f8c30, 0x7f3db80bb000, 0xc420ef82a0, 0x5, 0xc420ef82a0, 0x100000000f1a8de)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc423d9bc00, 0xf52717, 0xc423d9bd20, 0x4)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc423d9bc00, 0xc421d8d000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc421fe74a0, 0xc420de8ff8, 0x9, 0x9, 0x2, 0xc421840c78, 0x405d75)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc421fe74a0, 0xc420de8ff8, 0x9, 0x9, 0x9, 0x413c87, 0xc421840d08, 0xc421840d08)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc421fe74a0, 0xc420de8ff8, 0x9, 0x9, 0x4701e3, 0xc421840d18, 0xc421840d48)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc420de8ff8, 0x9, 0x9, 0xfeff60, 0xc421fe74a0, 0x0, 0x100000000, 0xc421610500, 0xc421840e00)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc420de8fc0, 0xc42357e210, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc421840fb8, 0xf51a90, 0xc4204b37b8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc423bf6e00)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 131 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc702d60, 0x72, 0xc420fc3858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420b5f818, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc420b5f818, 0xc420b50000, 0x8000, 0x8000)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc420b5f800, 0xc420b50000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc420b5f800, 0xc420b50000, 0x8000, 0x8000, 0x8, 0x8, 0x7ff3)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420a805f0, 0xc420b50000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc420dae870, 0x7f3db80bb000, 0xc420a805f0, 0x5, 0xc420a805f0, 0x0)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc4208b0000, 0xf52717, 0xc4208b0120, 0xc423e9a5c0)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc4208b0000, 0xc420fdb000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc421090ea0, 0xc4206890d8, 0x9, 0x9, 0xc420fc3c50, 0x470183, 0xc420fc3c60)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc421090ea0, 0xc4206890d8, 0x9, 0x9, 0x9, 0xc420932101, 0xc420fc3d00, 0xb1e57b)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc421090ea0, 0xc4206890d8, 0x9, 0x9, 0xc420064060, 0xc420fc3d30, 0xb1e48d)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc4206890d8, 0x9, 0x9, 0xfeff60, 0xc421090ea0, 0x0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc4206890a0, 0xc422c6ddd0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc420fc3fb8, 0xf51a90, 0xc420fa1fb8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc42107b880)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 211 [select]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc420bf66c0, 0xc420cb6cf0, 0xfff2c0, 0xc4208d6800, 0x0, 0x0)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Runner).Eval.func1(0xc4209fdf20, 0xc4201b0000, 0xc42000f018, 0xc420cbb470)
runner/runner.go:266 +0x64
created by github.com/grailbio/reflow/runner.(*Runner).Eval
runner/runner.go:265 +0x16c
goroutine 212 [select]:
github.com/grailbio/reflow/runner.(*Stealer).Go(0xc420a237e0, 0xff7d60, 0xc420bf66c0, 0xc4202b7200)
runner/stealer.go:41 +0x14e
created by github.com/grailbio/reflow/runner.(*Runner).Eval
runner/runner.go:286 +0x374
goroutine 51963 [select]:
net/http.(*persistConn).writeLoop(0xc42af1e6c0)
GOROOT/src/net/http/transport.go:1822 +0x14b
created by net/http.(*Transport).dialConn
GOROOT/src/net/http/transport.go:1238 +0x97f
goroutine 46016 [select, 1 minutes]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc420d9ab40, 0xc420cb6cf0, 0xfff2c0, 0xc423338a20, 0xc421bb3620, 0xc421bb3530)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc420d9ab40, 0xc420a237e0, 0xfff2c0, 0xc423338a20, 0xc4209fc7d0)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 10127 [select]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc422058a00, 0xc420cb6cf0, 0xfff2c0, 0xc4207c5560, 0xc42292cf00, 0x0)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc422058a00, 0xc420a237e0, 0xfff2c0, 0xc4207c5560, 0xc420c7b870)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 52455 [select]:
net/http.(*persistConn).roundTrip(0xc42af1e6c0, 0xc424202360, 0x0, 0x0, 0x0)
GOROOT/src/net/http/transport.go:2033 +0x5a7
net/http.(*Transport).RoundTrip(0x1534400, 0xc420a4e200, 0x1534400, 0x0, 0x0)
GOROOT/src/net/http/transport.go:422 +0x8f2
net/http.send(0xc420a4e200, 0xff09a0, 0x1534400, 0x0, 0x0, 0x0, 0xc420a802a0, 0xc420772230, 0xc423709750, 0x1)
GOROOT/src/net/http/client.go:252 +0x185
net/http.(*Client).send(0x1586f80, 0xc420a4e200, 0x0, 0x0, 0x0, 0xc420a802a0, 0x0, 0x1, 0x280)
GOROOT/src/net/http/client.go:176 +0xfa
net/http.(*Client).Do(0x1586f80, 0xc420a4e200, 0xc423709930, 0x10, 0xc423709928)
GOROOT/src/net/http/client.go:615 +0x28d
github.com/aws/aws-sdk-go/aws/corehandlers.sendFollowRedirects(0xc4212e8800, 0xf51318, 0xc421ed5800, 0xc4212e8800)
external/com_github_aws_aws_sdk_go/aws/corehandlers/handlers.go:120 +0x3b
github.com/aws/aws-sdk-go/aws/corehandlers.glob..func3(0xc4212e8800)
external/com_github_aws_aws_sdk_go/aws/corehandlers/handlers.go:112 +0x87
github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run(0xc4212e8998, 0xc4212e8800)
external/com_github_aws_aws_sdk_go/aws/request/handlers.go:195 +0x9d
github.com/aws/aws-sdk-go/aws/request.(*Request).Send(0xc4212e8800, 0x0, 0x0)
external/com_github_aws_aws_sdk_go/aws/request/request.go:490 +0x17f
github.com/aws/aws-sdk-go/service/s3.(*S3).HeadObject(0xc420f60078, 0xc421032150, 0x3, 0xc420d8a4b0, 0x4f)
external/com_github_aws_aws_sdk_go/service/s3/api.go:3345 +0x4d
github.com/grailbio/reflow/repository/s3.(*Repository).Stat(0xc4201bd470, 0xff7de0, 0xc420f04600, 0x5, 0x7e92c80b750d5970, 0xe474d458975503bc, 0xd8be9b47929a3c92, 0x6285615487cbab51, 0x0, 0x0, ...)
repository/s3/s3.go:65 +0x235
github.com/grailbio/reflow/repository.(*Manager).NeedTransfer.func1(0xc400000008, 0xf527c8)
repository/manager.go:210 +0x113
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc421690700, 0xc421096000)
external/org_golang_x_sync/errgroup/errgroup.go:58 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
external/org_golang_x_sync/errgroup/errgroup.go:55 +0x66
goroutine 46053 [select]:
net/http.(*persistConn).writeLoop(0xc4214605a0)
GOROOT/src/net/http/transport.go:1822 +0x14b
created by net/http.(*Transport).dialConn
GOROOT/src/net/http/transport.go:1238 +0x97f
goroutine 52565 [semacquire]:
sync.runtime_Semacquire(0xc421690714)
GOROOT/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc421690708)
GOROOT/src/sync/waitgroup.go:129 +0x72
golang.org/x/sync/errgroup.(*Group).Wait(0xc421690700, 0xc421096000, 0xc4216903c0)
external/org_golang_x_sync/errgroup/errgroup.go:41 +0x31
github.com/grailbio/reflow/repository.(*Manager).NeedTransfer(0xc420b5f200, 0xff8260, 0xc421a50100, 0xffbe60, 0xc4201bd470, 0xc420d8a3c0, 0x1, 0x1, 0x0, 0x0, ...)
repository/manager.go:220 +0x3f9
github.com/grailbio/reflow/repository.(*Manager).Transfer(0xc420b5f200, 0xff8260, 0xc421a50100, 0xffbe60, 0xc4201bd470, 0xffbda0, 0xc421a50080, 0xc420d8a3c0, 0x1, 0x1, ...)
repository/manager.go:188 +0x8a
github.com/grailbio/reflow.(*Eval).CacheWrite(0xc4202b7200, 0xff8260, 0xc421a50100, 0xc42257d500, 0xffbda0, 0xc421a50080, 0xc423d94240, 0x20002)
eval.go:1138 +0x227
github.com/grailbio/reflow/runner.(*worker).do.func2(0xc42b21a2a0, 0xff7d60, 0xc423df4340, 0xff32a0, 0xc422e4a460, 0xc42257d500)
runner/worker.go:240 +0x10d
created by github.com/grailbio/reflow/runner.(*worker).do
runner/worker.go:239 +0x8bd
goroutine 50993 [select]:
net/http.(*persistConn).writeLoop(0xc42262ed80)
GOROOT/src/net/http/transport.go:1822 +0x14b
created by net/http.(*Transport).dialConn
GOROOT/src/net/http/transport.go:1238 +0x97f
goroutine 52491 [select]:
golang.org/x/net/http2.(*ClientConn).roundTrip(0xc420ead500, 0xc420a4f500, 0x0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/transport.go:891 +0x844
golang.org/x/net/http2.(*Transport).RoundTripOpt(0xc42028b900, 0xc420a4f500, 0x0, 0xc423b6e8d8, 0x40c780, 0xf1b8e2)
external/org_golang_x_net/http2/transport.go:378 +0x150
golang.org/x/net/http2.(*Transport).RoundTrip(0xc42028b900, 0xc420a4f500, 0xc423b6e910, 0x0, 0xc420bf6780)
external/org_golang_x_net/http2/transport.go:340 +0x3a
golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip(0xc42028b900, 0xc420a4f500, 0xc4208921e0, 0x5, 0xc42086caa8)
external/org_golang_x_net/http2/configure_transport.go:75 +0x39
net/http.(*Transport).RoundTrip(0xc4201d20f0, 0xc420a4f500, 0xc4201d20f0, 0x0, 0x0)
GOROOT/src/net/http/transport.go:380 +0xc36
net/http.send(0xc420a4f500, 0xff09a0, 0xc4201d20f0, 0x0, 0x0, 0x0, 0xc42b12f728, 0xc4210b4780, 0xc423b6ed00, 0x1)
GOROOT/src/net/http/client.go:252 +0x185
net/http.(*Client).send(0xc420cb7e60, 0xc420a4f500, 0x0, 0x0, 0x0, 0xc42b12f728, 0x0, 0x1, 0x78)
GOROOT/src/net/http/client.go:176 +0xfa
net/http.(*Client).Do(0xc420cb7e60, 0xc420a4f500, 0xc420bf6780, 0xc420a4f500, 0x78)
GOROOT/src/net/http/client.go:615 +0x28d
golang.org/x/net/context/ctxhttp.Do(0xff7d60, 0xc420bf6780, 0xc420cb7e60, 0xc420a4f200, 0x0, 0x0, 0xc420a4f200)
external/org_golang_x_net/context/ctxhttp/ctxhttp.go:30 +0x6e
github.com/grailbio/reflow/rest.(*ClientCall).Do(0xc421409450, 0xff7d60, 0xc420bf6780, 0x0, 0x0, 0xc423b6f088, 0x2, 0x2)
rest/client.go:120 +0x19a
github.com/grailbio/reflow/pool/client.(*clientExec).Wait(0xc42099b4a0, 0xff7d60, 0xc420bf6780, 0x0, 0x0)
pool/client/client.go:315 +0x181
github.com/grailbio/reflow.(*Eval).exec(0xc4202b7200, 0xff7d60, 0xc420bf6780, 0xc420fd7180, 0xbea44ec4d6df18f9, 0x950cce693a)
eval.go:1383 +0x3ec
github.com/grailbio/reflow.(*Eval).eval(0xc4202b7200, 0xff7d60, 0xc420bf6780, 0xc420fd7180, 0x0, 0x0)
eval.go:996 +0x956
github.com/grailbio/reflow.(*Eval).Do.func4(0xc4202b7200, 0xc42074e250, 0xc420fd7180)
eval.go:459 +0x4f
created by github.com/grailbio/reflow.(*Eval).Do
eval.go:458 +0x4d5
goroutine 45858 [IO wait, 1 minutes]:
internal/poll.runtime_pollWait(0x7f3dbc72ab90, 0x72, 0xc423841858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc422548398, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc422548398, 0xc420907000, 0x400, 0x400)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc422548380, 0xc420907000, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc422548380, 0xc420907000, 0x400, 0x400, 0x8, 0x8, 0x3f3)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420ef8250, 0xc420907000, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc421575020, 0x7f3db80bb000, 0xc420ef8250, 0x5, 0xc420ef8250, 0x0)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc42b15e000, 0xf52717, 0xc42b15e120, 0xc42b02de00)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc42b15e000, 0xc422e46000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc421fe6c00, 0xc420918c78, 0x9, 0x9, 0xc423841c50, 0x470183, 0xc423841c60)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc421fe6c00, 0xc420918c78, 0x9, 0x9, 0x9, 0xc42acdcb01, 0xc423841d00, 0xb1e57b)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc421fe6c00, 0xc420918c78, 0x9, 0x9, 0xc420064060, 0xc423841d30, 0xb1e48d)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc420918c78, 0x9, 0x9, 0xfeff60, 0xc421fe6c00, 0x0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc420918c40, 0xc42291a810, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc423841fb8, 0xf51a90, 0xc4200517b8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc420dc41c0)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 38017 [select]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc423df4a80, 0xc420cb6cf0, 0xfff2c0, 0xc42228d220, 0xc422bcaba0, 0xb13bf2)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc423df4a80, 0xc420a237e0, 0xfff2c0, 0xc42228d220, 0xc4230e3b30)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 28019 [select]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc42ac9d1a0, 0xff7e20, 0xc42ac9d260)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc42418bc80, 0xc42ac9d1a0, 0xc420a237e0, 0xc422459e90, 0xfff2c0, 0xc42b0c3b80)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 46017 [select, 2 minutes]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc422415380, 0xff7e20, 0xc423764630)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc420d9ab40, 0xc422415380, 0xc420a237e0, 0xc4209fc7d0, 0xfff2c0, 0xc423338a20)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 46052 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc733340, 0x72, 0xc420fc0760)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc421050718, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc421050718, 0xc422904400, 0x400, 0x400)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc421050700, 0xc422904400, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc421050700, 0xc422904400, 0x400, 0x400, 0xc420fc0898, 0x411f26, 0x7f3dbc70dc98)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc42b12f6b8, 0xc422904400, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc422414540, 0x7f3db80bb000, 0xc42b12f6b8, 0x5, 0xc42b12f6b8, 0x5)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc42220aa80, 0xf52717, 0xc42220aba0, 0x458500)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc42220aa80, 0xc421c19000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
net/http.(*persistConn).Read(0xc4214605a0, 0xc421c19000, 0x1000, 0x1000, 0xc420fc0b98, 0x405585, 0xc4215eeea0)
GOROOT/src/net/http/transport.go:1453 +0x136
bufio.(*Reader).fill(0xc42afaf4a0)
GOROOT/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).Peek(0xc42afaf4a0, 0x1, 0x0, 0x0, 0x0, 0xc421596540, 0x0)
GOROOT/src/bufio/bufio.go:132 +0x3a
net/http.(*persistConn).readLoop(0xc4214605a0)
GOROOT/src/net/http/transport.go:1601 +0x185
created by net/http.(*Transport).dialConn
GOROOT/src/net/http/transport.go:1237 +0x95a
goroutine 51962 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc6c7a38, 0x72, 0xc42391d760)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4239f0298, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4239f0298, 0xc423bc2400, 0x400, 0x400)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4239f0280, 0xc423bc2400, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4239f0280, 0xc423bc2400, 0x400, 0x400, 0x60b7b0, 0xc4220a3320, 0xc42b1550ce)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc423ab0020, 0xc423bc2400, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc423a81e30, 0x7f3db80bb000, 0xc423ab0020, 0x5, 0xc423ab0020, 0x1)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc42b00a380, 0xf52717, 0xc42b00a4a0, 0xc42391de58)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc42b00a380, 0xc42b155000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
net/http.(*persistConn).Read(0xc42af1e6c0, 0xc42b155000, 0x1000, 0x1000, 0x1534400, 0xc42af1e6c0, 0x0)
GOROOT/src/net/http/transport.go:1453 +0x136
bufio.(*Reader).fill(0xc4220a3320)
GOROOT/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).Peek(0xc4220a3320, 0x1, 0x0, 0x0, 0x0, 0xc421596360, 0x0)
GOROOT/src/bufio/bufio.go:132 +0x3a
net/http.(*persistConn).readLoop(0xc42af1e6c0)
GOROOT/src/net/http/transport.go:1601 +0x185
created by net/http.(*Transport).dialConn
GOROOT/src/net/http/transport.go:1237 +0x95a
goroutine 7059 [select]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc422682d80, 0xc420cb6cf0, 0xfff2c0, 0xc420b59220, 0xc4204b2f00, 0x0)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc422682d80, 0xc420a237e0, 0xfff2c0, 0xc420b59220, 0xc42360bff0)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 27757 [IO wait]:
internal/poll.runtime_pollWait(0x7f3db80fd668, 0x72, 0xc423845858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4215be098, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4215be098, 0xc423b84000, 0x2000, 0x2000)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4215be080, 0xc423b84000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4215be080, 0xc423b84000, 0x2000, 0x2000, 0x8, 0x8, 0x1ff3)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420f60100, 0xc423b84000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc42ad3e480, 0x7f3db80bb000, 0xc420f60100, 0x5, 0xc420f60100, 0x0)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc422374e00, 0xf52717, 0xc422374f20, 0xc423e9a100)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc422374e00, 0xc421aa7000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc421738780, 0xc420b363b8, 0x9, 0x9, 0xc423845c50, 0x470183, 0xc423845c60)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc421738780, 0xc420b363b8, 0x9, 0x9, 0x9, 0xc420287101, 0xc423845d00, 0xb1e57b)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc421738780, 0xc420b363b8, 0x9, 0x9, 0xc420064060, 0xc423845d30, 0xb1e48d)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc420b363b8, 0x9, 0x9, 0xfeff60, 0xc421738780, 0x0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc420b36380, 0xc420f662d0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc423845fb8, 0xf51a90, 0xc420c6b7b8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc42107aa80)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 31957 [select]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc423360e10, 0xff7e20, 0xc421dba000)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc421b18280, 0xc423360e10, 0xc420a237e0, 0xc424080490, 0xfff2c0, 0xc42b02c480)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 52572 [runnable]:
github.com/grailbio/reflow.(*Eval).Do.func4(0xc4202b7200, 0xc42074e250, 0xc420ecee00)
eval.go:458
created by github.com/grailbio/reflow.(*Eval).Do
eval.go:458 +0x4d5
goroutine 38052 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc6a4760, 0x72, 0xc423844858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc42afa6418, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc42afa6418, 0xc420f5c000, 0x2000, 0x2000)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc42afa6400, 0xc420f5c000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc42afa6400, 0xc420f5c000, 0x2000, 0x2000, 0xc423844a38, 0xb23792, 0xe80720)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420a80bf8, 0xc420f5c000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc423461f80, 0x7f3db80bb000, 0xc420a80bf8, 0x5, 0xc420a80bf8, 0x100000000f1a8de)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc420b69880, 0xf52717, 0xc420b699a0, 0x4)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc420b69880, 0xc4204d4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc42209b980, 0xc420c82738, 0x9, 0x9, 0x2, 0xc423844c78, 0x405d75)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc42209b980, 0xc420c82738, 0x9, 0x9, 0x9, 0x413c87, 0xc423844d08, 0xc423844d08)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc42209b980, 0xc420c82738, 0x9, 0x9, 0x4701e3, 0xc423844d18, 0xc423844d48)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc420c82738, 0x9, 0x9, 0xfeff60, 0xc42209b980, 0x0, 0x100000000, 0xc422142280, 0xc423844e00)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc420c82700, 0xc42237f200, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc423844fb8, 0xf51a90, 0xc4220637b8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc420dc4700)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 30009 [select]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc420665230, 0xff7e20, 0xc422e2a000)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc422059180, 0xc420665230, 0xc420a237e0, 0xc4236f0520, 0xfff2c0, 0xc424177f40)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 30008 [select]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc422059180, 0xc420cb6cf0, 0xfff2c0, 0xc424177f40, 0x86088f, 0x1)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc422059180, 0xc420a237e0, 0xfff2c0, 0xc424177f40, 0xc4236f0520)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 34442 [select]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc42b21a2a0, 0xff7e20, 0xc4212945a0)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc4220580c0, 0xc42b21a2a0, 0xc420a237e0, 0xc420986dd0, 0xfff2c0, 0xc422a0c280)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 28018 [select]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc42418bc80, 0xc420cb6cf0, 0xfff2c0, 0xc42b0c3b80, 0xc420c687b0, 0xc420c68730)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc42418bc80, 0xc420a237e0, 0xfff2c0, 0xc42b0c3b80, 0xc422459e90)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 7060 [select, 2 minutes]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc423658d50, 0xff7e20, 0xc423658d80)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc422682d80, 0xc423658d50, 0xc420a237e0, 0xc42360bff0, 0xfff2c0, 0xc420b59220)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 34441 [select]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc4220580c0, 0xc420cb6cf0, 0xfff2c0, 0xc422a0c280, 0x8bdc0cce503e8289, 0x2cc043410d29043b)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc4220580c0, 0xc420a237e0, 0xfff2c0, 0xc422a0c280, 0xc420986dd0)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 10128 [select, 2 minutes]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc422455ce0, 0xff7e20, 0xc422455d10)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc422058a00, 0xc422455ce0, 0xc420a237e0, 0xc420c7b870, 0xfff2c0, 0xc4207c5560)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 52569 [runnable]:
github.com/grailbio/reflow/values.Map.Insert(0xc420f66ab0, 0x5, 0xa48e782fdfd7ff0b, 0x2d1572f43c26a589, 0x9787c9358c8abbfd, 0xb6c3d22e4a0324f1, 0x0, 0x0, 0x0, 0x0, ...)
values/values.go:85 +0x468
github.com/grailbio/reflow/syntax.Force.func4()
syntax/force.go:83 +0x13d
github.com/grailbio/reflow/syntax.(*resolver).Resolve.func1(0xc420455300, 0x120, 0x120, 0xdae440)
syntax/force.go:197 +0xc6
github.com/grailbio/reflow.(*Flow).canonicalize.func2(0xc420455300, 0x120, 0x120, 0xc420455300)
flow.go:839 +0x57
github.com/grailbio/reflow.(*Eval).eval(0xc4202b7200, 0xff7d60, 0xc420bf6780, 0xc420945c00, 0x0, 0x0)
eval.go:1067 +0x14e7
github.com/grailbio/reflow.(*Eval).Do.func4(0xc4202b7200, 0xc42074e250, 0xc420945c00)
eval.go:459 +0x4f
created by github.com/grailbio/reflow.(*Eval).Do
eval.go:458 +0x4d5
goroutine 31956 [select, 1 minutes]:
github.com/grailbio/reflow/pool.Keepalive(0xff7d60, 0xc421b18280, 0xc420cb6cf0, 0xfff2c0, 0xc42b02c480, 0xc4204b6f78, 0x42b9c9)
pool/pool.go:148 +0x2c7
github.com/grailbio/reflow/runner.(*Stealer).Go.func1(0xff7d60, 0xc421b18280, 0xc420a237e0, 0xfff2c0, 0xc42b02c480, 0xc424080490)
runner/stealer.go:60 +0x78
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:59 +0x460
goroutine 50992 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc6c9e58, 0x72, 0xc4217df760)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420fa3a18, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc420fa3a18, 0xc420f3cc00, 0x400, 0x400)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc420fa3a00, 0xc420f3cc00, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc420fa3a00, 0xc420f3cc00, 0x400, 0x400, 0x2, 0xc4209ebe10, 0x1)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc423540258, 0xc420f3cc00, 0x400, 0x400, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc4217b6690, 0x7f3db80bb000, 0xc423540258, 0x5, 0xc423540258, 0x2)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc42306a000, 0xf52717, 0xc42306a120, 0x458500)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc42306a000, 0xc42ae6a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
net/http.(*persistConn).Read(0xc42262ed80, 0xc42ae6a000, 0x1000, 0x1000, 0xc4217dfb98, 0x405585, 0xc4207705a0)
GOROOT/src/net/http/transport.go:1453 +0x136
bufio.(*Reader).fill(0xc42277ca80)
GOROOT/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).Peek(0xc42277ca80, 0x1, 0x0, 0x0, 0x0, 0xc42ab80060, 0x0)
GOROOT/src/bufio/bufio.go:132 +0x3a
net/http.(*persistConn).readLoop(0xc42262ed80)
GOROOT/src/net/http/transport.go:1601 +0x185
created by net/http.(*Transport).dialConn
GOROOT/src/net/http/transport.go:1237 +0x95a
goroutine 34426 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc6a4900, 0x72, 0xc4217e0858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4239f0d18, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4239f0d18, 0xc4223fa000, 0x2000, 0x2000)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4239f0d00, 0xc4223fa000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4239f0d00, 0xc4223fa000, 0x2000, 0x2000, 0x20, 0x1, 0x7f3dbc62a640)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc423ab0660, 0xc4223fa000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc4222340f0, 0x7f3db80bb000, 0xc423ab0660, 0x5, 0xc423ab0660, 0x0)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc423d96700, 0xf52717, 0xc423d96820, 0xc420cdb320)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc423d96700, 0xc4237ec000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc420df2780, 0xc42194a498, 0x9, 0x9, 0xc4217e0c50, 0x470183, 0xc4217e0c60)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc420df2780, 0xc42194a498, 0x9, 0x9, 0x9, 0xc420b4ca01, 0xc4217e0d00, 0xb1e57b)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc420df2780, 0xc42194a498, 0x9, 0x9, 0xc420064060, 0xc4217e0d30, 0xb1e48d)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc42194a498, 0x9, 0x9, 0xfeff60, 0xc420df2780, 0x0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc42194a460, 0xc42422d5c0, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc4217e0fb8, 0xf51a90, 0xc4228d37b8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc42107afc0)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 52454 [select]:
context.propagateCancel.func1(0xff8260, 0xc421a50100, 0xff2760, 0xc4216903c0)
GOROOT/src/context/context.go:260 +0xff
created by context.propagateCancel
GOROOT/src/context/context.go:259 +0x1a7
goroutine 38082 [select]:
github.com/grailbio/reflow/runner.(*worker).Go(0xc421efdc50, 0xff7e20, 0xc420851650)
runner/worker.go:66 +0x2dd
github.com/grailbio/reflow/runner.(*Stealer).Go.func2(0xff7d60, 0xc423df4a80, 0xc421efdc50, 0xc420a237e0, 0xc4230e3b30, 0xfff2c0, 0xc42228d220)
runner/stealer.go:69 +0xa7
created by github.com/grailbio/reflow/runner.(*Stealer).Go
runner/stealer.go:66 +0x4ca
goroutine 29935 [IO wait]:
internal/poll.runtime_pollWait(0x7f3dbc733680, 0x72, 0xc423846858)
GOROOT/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc423e23218, 0x72, 0xffffffffffffff00, 0xff1f20, 0x1542c90)
GOROOT/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc423e23218, 0xc422568000, 0x2000, 0x2000)
GOROOT/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc423e23200, 0xc422568000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc423e23200, 0xc422568000, 0x2000, 0x2000, 0xc423846a38, 0xb23792, 0xe80720)
GOROOT/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420ef8570, 0xc422568000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
GOROOT/src/net/net.go:176 +0x6a
crypto/tls.(*block).readFromUntil(0xc423759f80, 0x7f3db80bb000, 0xc420ef8570, 0x5, 0xc420ef8570, 0x100000000f1a8de)
GOROOT/src/crypto/tls/conn.go:493 +0x96
crypto/tls.(*Conn).readRecord(0xc422375500, 0xf52717, 0xc422375620, 0x4)
GOROOT/src/crypto/tls/conn.go:595 +0xe0
crypto/tls.(*Conn).Read(0xc422375500, 0xc420baa000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
GOROOT/src/crypto/tls/conn.go:1156 +0x100
bufio.(*Reader).Read(0xc4225243c0, 0xc420de82d8, 0x9, 0x9, 0x2, 0xc423846c78, 0x405d75)
GOROOT/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0xfeff60, 0xc4225243c0, 0xc420de82d8, 0x9, 0x9, 0x9, 0x413c87, 0xc423846d08, 0xc423846d08)
GOROOT/src/io/io.go:309 +0x86
io.ReadFull(0xfeff60, 0xc4225243c0, 0xc420de82d8, 0x9, 0x9, 0x4701e3, 0xc423846d18, 0xc423846d48)
GOROOT/src/io/io.go:327 +0x58
golang.org/x/net/http2.readFrameHeader(0xc420de82d8, 0x9, 0x9, 0xfeff60, 0xc4225243c0, 0x0, 0x100000000, 0xc422975e00, 0xc423846e00)
external/org_golang_x_net/http2/frame.go:237 +0x7b
golang.org/x/net/http2.(*Framer).ReadFrame(0xc420de82a0, 0xc42b1fa450, 0x0, 0x0, 0x0)
external/org_golang_x_net/http2/frame.go:492 +0xa4
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc423846fb8, 0xf51a90, 0xc4220b57b8)
external/org_golang_x_net/http2/transport.go:1477 +0x8e
golang.org/x/net/http2.(*ClientConn).readLoop(0xc420dc4000)
external/org_golang_x_net/http2/transport.go:1405 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
external/org_golang_x_net/http2/transport.go:592 +0x65c
goroutine 52571 [runnable]:
github.com/grailbio/reflow/syntax.Force.func4()
syntax/force.go:81 +0x185
github.com/grailbio/reflow/syntax.(*resolver).Resolve.func1(0xc423a06000, 0x120, 0x120, 0xdae440)
syntax/force.go:197 +0xc6
github.com/grailbio/reflow.(*Flow).canonicalize.func2(0xc423a06000, 0x120, 0x120, 0xc423a06000)
flow.go:839 +0x57
github.com/grailbio/reflow.(*Eval).eval(0xc4202b7200, 0xff7d60, 0xc420bf6780, 0xc420fcaa80, 0x0, 0x0)
eval.go:1067 +0x14e7
github.com/grailbio/reflow.(*Eval).Do.func4(0xc4202b7200, 0xc42074e250, 0xc420fcaa80)
eval.go:459 +0x4f
created by github.com/grailbio/reflow.(*Eval).Do
eval.go:458 +0x4d5
I don't mind sharing the context at all, it is for an article after all, but I don't think it is relevant without the whole code. In terms of the mechanism, the problematic part is a map-reduce with around 400 items: I do the same operation around 400 times in parallel and collect that into a list that I turn into a directory for reduction. Basically, the problematic part is in the spirit of (and I've tried to explicit all the actual types used):
val dirs = make("$/dirs")
func job(pop_file file, pop_name string, spec file) file = exec(...) # 1GiB, 1 CPU
func reduce(pop, comb dir) file = exec(...) # 5GiB, 1 CPU
val populations [string, file] = ... // size: 4, around 600 MiB
val populations__dir dir = dirs.Make(populations)
val specs dir = ... // size: around 100 files, 100 KiB
val Main file = {
val comb [string, file] = [(pop_name + "/" + spec_id, job(pop_file, pop_name, spec_file)) |
(pop_name, pop_file) <- populations,
(spec_id, spec_file) <- map(specs)]
reduce(populations__dir, dirs.Make(map(comb)))
}
Thank you again for your time.
Thanks @hchauvin for all the details! I think I've been able to induce a condition that results in this, when the evaluator has a lot of CPUs. Working on it now:
/*
val dirs = make("$/dirs")
func job(pop_file file, pop_name string, spec file) file = exec(...) # 1GiB, 1 CPU
func reduce(pop, comb dir) file = exec(...) # 5GiB, 1 CPU
val populations [string, file] = ... // size: 4, around 600 MiB
val populations__dir dir = dirs.Make(populations)
val specs dir = ... // size: around 100 files, 100 KiB
val Main file = {
val comb [string, file] = [(pop_name + "/" + spec_id, job(pop_file, pop_name, spec_file)) |
(pop_name, pop_file) <- populations,
(spec_id, spec_file) <- map(specs)]
reduce(populations__dir, dirs.Make(map(comb)))
}
*/
val strings = make("$/strings")
val N = 1000
func makeSpecs(n int) = map([delay((key, strings.FromInt(key))) | key <- range(0, n)])
val specs = makeSpecs(N)
@requires(cpu := 1)
val Main = [makeSpecs(400) | (key, value) <- delay(specs)]
I believe I understand the issue now, and the fix is easy.
The problem is when there's multiple levels of continuations (OpK), and the outer K resolves to the same inner K, then the inner K can be called multiple times. This is usually not a problem, but in the case of maps, it tries to re-resolve them, which causes concurrent reads and writes.
We could attempt to de-dup the call to the inner K, but this doesn't save much in reality and would introduce some additional complexity. Instead, we'll fix the implementation of the map force continuation to do the right thing.
Fix should be coming shortly!
Thanks so much for your help.
fyi, the fix is really simple:
diff --git a/go/src/github.com/grailbio/reflow/syntax/force.go b/go/src/github.com/grailbio/reflow/syntax/force.go
index fcf1f609f3..249edb010c 100644
--- a/go/src/github.com/grailbio/reflow/syntax/force.go
+++ b/go/src/github.com/grailbio/reflow/syntax/force.go
@@ -7,6 +7,7 @@ package syntax
import (
"log"
"sort"
+ "sync"
"github.com/grailbio/base/digest"
"github.com/grailbio/reflow"
@@ -184,18 +185,26 @@ func (r *resolver) Resolve(proc func()) values.T {
}
return r.v
}
+ var once sync.Once
writeN(r.dw, int(r.t.Kind))
return &reflow.Flow{
Op: reflow.OpK,
Deps: r.deps,
FlowDigest: r.dw.Digest(),
K: func(vs []values.T) *reflow.Flow {
- for i := range vs {
- *r.vps[i] = vs[i]
- }
- if proc != nil {
- proc()
- }
+ // Initialize the underlying datastructure only once.
+ // This makes it safe to invoke the K multiple times.
+ // Since the underlying computation is deterministic,
+ // we are free to ignore the vs from subsequent
+ // computations.
+ once.Do(func() {
+ for i := range vs {
+ *r.vps[i] = vs[i]
+ }
+ if proc != nil {
+ proc()
+ }
+ })
return flow(r.v, r.t)
},
}
It's currently pending code review at GRAIL; will show up here in short order.
Amazing! I can confirm it now works with the conditions listed above.
Thanks for confirming. I'll probably make a maintenance release today.
Running on HEAD (with go 1.10), I have a race condition that makes reflow panic:
Unfortunately I did not compile reflow for this run with '-race' to have more info on the race condition, but when I do, I find such race conditions in other parts of the code. There is no panic here, as no map is involved. For instance:
In either case, it might be because there is no lock on the
reflow.Flow
structure, but I might be mistaken.