Closed NeoTeo closed 7 years ago
@NeoTeo this happens every time you restart the daemon?
Did you build from source? Or use a downloaded binary from dists?
Hey @whyrusleeping :) It happens every time I restart. The binary is downloaded from dists.
@NeoTeo what version of OSX are you running? I can't reproduce this issue...
Did you start with a new repo, or is this an older repo that has a bunch of stuff in it still?
It first happened on El Capitan 10.11.6 (15G31) so I tried it on a different machine which runs Sierra 10.12 beta (16A270f). On the Sierra I also deleted the .ipfs and did an ipfs init. Just tried the El Cap version (which has an existing v4 repo) again and now it doesn't crash but alternates between hanging and returning Error: merkledag: not found
. I now worry it's just my machine having a fit, but it worked fine on 0.4.2 :/
@NeoTeo alright, i'll look into it more. Do you have a repeatable way of reproducing the issue? Or is it just more or less random?
On Sierra the crash is repeatable, on El Cap the alternating behaviour is consistent (and never actually cats the hash). I have just tried doing a
http://localhost:8080/ipfs/QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB
from Safari on the El Cap machine and it comes up immediately so the node's able to get the hash ok.
Let me know if you need me to try something.
@NeoTeo This is really bizarre, it feels like a bug in go to me.
@NeoTeo If its not too much trouble, on the system where the crash is reproducible, could you try installing the latest release candidate of go (1.7) and try building ipfs from source? I'm thinking that this might be an issue caused by changes in OSX syscalls
ok gonna try that.
Good hunch! Not crashing on Sierra now. Gonna try on El Cap to see if it solves the hanging issue there as well.
If that fixes it, then i'll just have to make sure the release binaries for 0.4.3 get built with go1.7 (which i was hoping to do anyways)
Alas, El Cap still gives the error: merkledag: not found.
The Sierra binary wasn't rebuilt. Upgrading go was enough.
Aaand, it crashed after successfully cat'ing the file. I'll try a build from source.
Trouble with "make install". Overrode check_go_version to accept "1.7rc5 " version number but gx seems to not like it either as its strconv.ParseInt doesn't like the rc part...
@NeoTeo just do:
go get -u github.com/whyrusleeping/gx
go get -u github.com/whyrusleeping/gx-go
gx install --global
go install ./cmd/ipfs
I need to fix the go version parsing code there, its been biting me too
Hm...odd. The first time it gave me this:
teo@Vger-6 ~> ipfs cat QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB
fish: 'ipfs cat QmPZ9gcCEpqKTo6aq61g2n…' terminated by signal SIGSEGV (Address boundary error)
Ran it again straight after and it worked just fine. The node's been rock solid since first run. And any subsequent cat'ing works as expected \o/
Is it worth trying this on the El Cap version too?
Just built 0.4.3-rc2 from source on El Cap with go 1.7rc5 but unfortunately it still hangs on cat. Different problem? At least the Sierra problem seems fixed :)
Hm, spoke a bit too soon. An ipfs add
of a tiny text file causes the following crash:
https://gist.github.com/NeoTeo/fbaf0357ef5cf7ae1576f928d438d3e0
Anytime you see 'unexpected fault address', it's 99% of the time a go compiler bug.
You've rebuilt go-ipfs with go1.7-rc5 (theres an rc6 now with a bunch of OSX sierra changes) and have restarted the daemon?
Yep. I'm updating Sierra right now to b5, so I'll try go1.7-rc6 just after.
Looking good so far. Using rc6 the daemon has survived both an add and a cat.
Same setup on El Capitan still gives
Error: merkledag: not found
/me gently sobs
@NeoTeo on el capitan can you try with a freshly inited repo? Thats very odd that its still failing
I have run an ipfs repo verify (there's a lot there I'd be sad to lose). Can I safely squirrel away the .ipfs directory, init and then hope to somehow restore the repo after?
@NeoTeo yeah, you can just move the directory somewhere safe and move it back once we figure out the problem
:/ Seems it was the repo somehow. Weird it'd pass the repo verify though.
@NeoTeo thats no good! Is it just catting the one object that fails? Or do other things break?
Can you add a new file and cat it?
@whyrusleeping with the fresh init it adds and cats as it should. Moving the old repo back it hangs (or returns with the merkledag error) when cat'ing anything.
@NeoTeo on the old repo, when it hangs, can you give me the output of pressing Ctrl + \
?
@whyrusleeping
^\SIGQUIT: quit PC=0x6cc43 m=2
goroutine 0 [idle]: runtime.mach_semaphore_timedwait(0x3c00002303, 0x0, 0x700000080cf4, 0x3c, 0x0, 0xc4200009c0, 0x700000080d28, 0x62cc3, 0xdf8475800, 0x0, ...) /usr/local/go/src/runtime/sys_darwin_amd64.s:428 +0x13 runtime.semasleep1(0xdf8475800, 0x0) /usr/local/go/src/runtime/os_darwin.go:424 +0xda runtime.semasleep.func1() /usr/local/go/src/runtime/os_darwin.go:451 +0x33 runtime.systemstack(0x700000080d50) /usr/local/go/src/runtime/asm_amd64.s:314 +0xab runtime.semasleep(0xdf8475800, 0x0) /usr/local/go/src/runtime/os_darwin.go:452 +0x44 runtime.notetsleep_internal(0xdbf598, 0xdf8475800, 0xc4200009c0, 0x146943a38ecb61c7, 0xc4200009c0) /usr/local/go/src/runtime/lock_sema.go:198 +0x92 runtime.notetsleep(0xdbf598, 0xdf8475800, 0x146943946c6fea01) /usr/local/go/src/runtime/lock_sema.go:246 +0x75 runtime.sysmon() /usr/local/go/src/runtime/proc.go:3618 +0x14c runtime.mstart1() /usr/local/go/src/runtime/proc.go:1126 +0x11e runtime.mstart() /usr/local/go/src/runtime/proc.go:1096 +0x64
goroutine 1 [select]: net/http.(_persistConn).roundTrip(0xc420073900, 0xc4202931c0, 0x0, 0x0, 0x0) /usr/local/go/src/net/http/transport.go:1835 +0x93b net/http.(_Transport).RoundTrip(0xc4200160f0, 0xc4201542d0, 0xc4200160f0, 0x0, 0xc400000000) /usr/local/go/src/net/http/transport.go:380 +0x4ee net/http.send(0xc4201542d0, 0xcdb120, 0xc4200160f0, 0x0, 0x0, 0x0, 0x8, 0xc4201659e0, 0xc420128150) /usr/local/go/src/net/http/client.go:256 +0x15f net/http.(_Client).send(0xdbe4c0, 0xc4201542d0, 0x0, 0x0, 0x0, 0xc420128150, 0x0, 0x1) /usr/local/go/src/net/http/client.go:146 +0x102 net/http.(_Client).doFollowingRedirects(0xdbe4c0, 0xc4201542d0, 0x93fb20, 0x4, 0xfb601, 0xc420273f80) /usr/local/go/src/net/http/client.go:528 +0x5e5 net/http.(_Client).Do(0xdbe4c0, 0xc4201542d0, 0xc42026f740, 0x8c9e6d, 0x13) /usr/local/go/src/net/http/client.go:187 +0x11d github.com/ipfs/go-ipfs/commands/http.(_client).Send(0xc420293080, 0xcebc60, 0xc42011e180, 0x1, 0xc420233f40, 0xdbf3c0, 0x0) /Users/teo/source/go/src/github.com/ipfs/go-ipfs/commands/http/client.go:100 +0x639 main.callCommand(0xce5be0, 0xc420019000, 0xcebc60, 0xc42011e180, 0xdbf3c0, 0xd9c520, 0x0, 0x0, 0xc420165e60, 0x8206) /Users/teo/source/go/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:329 +0x666 main.(*cmdInvocation).Run(0xc420018f40, 0xce5be0, 0xc420019000, 0xcdac60, 0xc4200d8e00, 0xce5be0, 0xc420019000) /Users/teo/source/go/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:189 +0x116 main.main() /Users/teo/source/go/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:154 +0x349
goroutine 17 [syscall, locked to thread]: runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1
goroutine 18 [chan receive]: gx/ipfs/QmV3NSS3A1kX5s28r7yLczhDsXzkgo65cqRgKFXYunWZmD/metrics.init.1.func2() /Users/teo/source/go/src/gx/ipfs/QmV3NSS3A1kX5s28r7yLczhDsXzkgo65cqRgKFXYunWZmD/metrics/metrics.go:321 +0x88 created by gx/ipfs/QmV3NSS3A1kX5s28r7yLczhDsXzkgo65cqRgKFXYunWZmD/metrics.init.1 /Users/teo/source/go/src/gx/ipfs/QmV3NSS3A1kX5s28r7yLczhDsXzkgo65cqRgKFXYunWZmD/metrics/metrics.go:328 +0x66
goroutine 19 [syscall]: os/signal.signal_recv(0x0) /usr/local/go/src/runtime/sigqueue.go:116 +0x157 os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:22 +0x22 created by os/signal.init.1 /usr/local/go/src/os/signal/signal_unix.go:28 +0x41
goroutine 20 [select]: gx/ipfs/QmNQynaz7qfriSUJkiEZUrm2Wen1u3Kj9goZzWtrPyu7XR/go-log.(*MirrorWriter).logRoutine(0xc42012c340) /Users/teo/source/go/src/gx/ipfs/QmNQynaz7qfriSUJkiEZUrm2Wen1u3Kj9goZzWtrPyu7XR/go-log/writer.go:71 +0x32e created by gx/ipfs/QmNQynaz7qfriSUJkiEZUrm2Wen1u3Kj9goZzWtrPyu7XR/go-log.NewMirrorWriter /Users/teo/source/go/src/gx/ipfs/QmNQynaz7qfriSUJkiEZUrm2Wen1u3Kj9goZzWtrPyu7XR/go-log/writer.go:38 +0xf1
goroutine 10 [select, locked to thread]: runtime.gopark(0x93ffe0, 0x0, 0x8bce81, 0x6, 0x18, 0x2) /usr/local/go/src/runtime/proc.go:259 +0x13a runtime.selectgoImpl(0xc42002b730, 0x0, 0x18) /usr/local/go/src/runtime/select.go:423 +0x11d9 runtime.selectgo(0xc42002b730) /usr/local/go/src/runtime/select.go:238 +0x1c runtime.ensureSigM.func1() /usr/local/go/src/runtime/signal1_unix.go:304 +0x2d1 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1
goroutine 23 [chan receive]: main.(_IntrHandler).Handle.func1(0xc4200d8e00, 0xc4200138a0) /Users/teo/source/go/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:542 +0x96 created by main.(_IntrHandler).Handle /Users/teo/source/go/src/github.com/ipfs/go-ipfs/cmd/ipfs/main.go:547 +0x97
goroutine 27 [IO wait]: net.runtime_pollWait(0x1369488, 0x72, 0x6) /usr/local/go/src/runtime/netpoll.go:160 +0x59 net.(_pollDesc).wait(0xc42027b720, 0x72, 0xc4203629d0, 0xc4200121d0) /usr/local/go/src/net/fd_poll_runtime.go:73 +0x38 net.(_pollDesc).waitRead(0xc42027b720, 0xcdd9a0, 0xc4200121d0) /usr/local/go/src/net/fd_poll_runtime.go:78 +0x34 net.(_netFD).Read(0xc42027b6c0, 0xc42017f000, 0x1000, 0x1000, 0x0, 0xcdd9a0, 0xc4200121d0) /usr/local/go/src/net/fd_unix.go:243 +0x1a1 net.(_conn).Read(0xc420128158, 0xc42017f000, 0x1000, 0x1000, 0x0, 0x0, 0x0) /usr/local/go/src/net/net.go:173 +0x70 net/http.(_persistConn).Read(0xc420073900, 0xc42017f000, 0x1000, 0x1000, 0x11d1a3, 0xc42002d378, 0xc42002d388) /usr/local/go/src/net/http/transport.go:1256 +0x154 bufio.(_Reader).fill(0xc420131c80) /usr/local/go/src/bufio/bufio.go:97 +0x10c bufio.(_Reader).Peek(0xc420131c80, 0x1, 0xc420340060, 0xc42002d460, 0xc42002d3c8, 0x1efd8, 0x50) /usr/local/go/src/bufio/bufio.go:129 +0x62 net/http.(_persistConn).readLoop(0xc420073900) /usr/local/go/src/net/http/transport.go:1413 +0x1a1 created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1057 +0x4e9
goroutine 28 [select]: net/http.(_persistConn).writeLoop(0xc420073900) /usr/local/go/src/net/http/transport.go:1641 +0x3bd created by net/http.(_Transport).dialConn /usr/local/go/src/net/http/transport.go:1058 +0x50e
rax 0xe rbx 0x0 rcx 0x700000080cc8 rdx 0x0 rdi 0x2303 rsi 0x3c rbp 0x700000080d00 rsp 0x700000080cc8 r8 0xffffffffffffffff r9 0x34 r10 0x75731cbc r11 0x202 r12 0x1443f2c804aaa r13 0x144c2f4554db0 r14 0x1469439556e1bc00 r15 0x711b10 rip 0x6cc43 rflags 0x202 cs 0x7 fs 0x0 gs 0x0
@NeoTeo could I get the output from the daemon while that command is hanging? (press Ctrl + \
on the daemon while the client command is hung)
@whyrusleeping That's what that was. I did a
ipfs add acme.dump
then hit ctrl + \
No, wait. Maybe I misunderstood. On the daemon. Coming up...
@whyrusleeping Wow, that was a big one:
https://gist.github.com/NeoTeo/752acc221e42889c3c3a8a9fc58b1585
@NeoTeo that is really strange... can you try deleting the following from line 45 of core/commands/cat.go
and recompiling?
if err := corerepo.ConditionalGC(req.Context(), node, length); err != nil {
res.SetError(err, cmds.ErrNormal)
return
}
For some reason your stack is telling me its hanging while running a garbage collection... but not actually getting to the GC part, just trying to take the lock?
Yep, that works. But presumably you put that code there for a reason... some other part locking the GC?
...and, since a freshly inited repo doesn't hang, somehow related to dealing with a repo that has been migrated from the the pre 0.4 days and yet passes the repo verify
.
@NeoTeo how large is this repo? (the one that youre having troubles with)
I think this might be related to another issue where checking if we need a GC is too expensive
@whyrusleeping ~12 GB
@NeoTeo in your .ipfs config, can you change Datastore.StorageMax to something like 1000GB or something? and then try catting something with the normal binary (not with the removed GC code)
Yep, that's it! So a warning or a friendly fail message to to tell you you've exceeded your max would do it. Nice investigative work though @whyrusleeping 👍 :)
@whyrusleeping it might be the same issue. I had the default GC setting in my config, we might want to be more explicit about running GC because size limit was hit.
also the softGC shouldn't lock if there is real GC running.
We should not have that 'feature' enabled by default. If you want to set a size limit, you should have to do it manually
@NeoTeo gonna go ahead and close this as i think we solved all the problems :)
Please open a new issue if anything else comes up
Version/Platform/Processor information (from
ipfs version --all
): go-ipfs version: 0.4.3-rc2- Repo version: 4 System version: amd64/darwin Golang version: go1.6.2Type (bug, feature, meta, test failure, question): bug
Area (api, commands, daemon, fuse, etc): daemon
Priority (from P0: functioning, to P4: operations on fire): P3: frequent crashing
Description: I can reproduce the following crash by starting the daemon and then calling
ipfs cat QmPZ9gcCEpqKTo6aq61g2nXGUhM4iCL3ewB6LDXZCtioEB
The daemon crashes out with the output included below. Subsequent runs of the daemon crash out immediately.
Crash log: