Closed shenyaqi9527 closed 4 years ago
ok
Its the end of my day here so you may not hear from me until tomorrow.
@shenyaqi9527 is there any chance you are running out of disk space?
my disk is 1TB
Ok, probably not running out of disk space, but df -h
will tell you for sure.
Since starting investigation of this I have encounter 5 instances of this. From kern.log
:
# grep cardano-node /var/log/kern.log
Jan 20 14:01:36 nix kernel: [6124226.626857] cardano-node:w[13755]: segfault at 840554dfc0 ip 00007fa4abdd9d6e sp 00007fa49bffaa98 error 6 in libc-2.27.so[7fa4abc88000+1aa000]
Jan 20 18:00:47 nix kernel: [6138577.260721] cardano-node:w[24086]: segfault at 84079fd1d0 ip 00007f5a12efeb24 sp 00007f5a0a7f7a98 error 6 in libc-2.27.so[7f5a12dad000+1aa000]
Jan 20 18:19:44 nix kernel: [6139714.043880] cardano-node:w[24190]: segfault at 84009cd480 ip 00007fb987390d6e sp 00007fb97e7f7a98 error 6 in libc-2.27.so[7fb98723f000+1aa000]
Jan 20 19:17:49 nix kernel: [ 2608.843672] traps: cardano-node:w[1150] general protection fault ip:7fae31864d6e sp:7fadfcff4a98 error:0 in libc-2.27.so[7fae31713000+1aa000]
Jan 21 07:34:36 nix kernel: [46815.966810] cardano-node:w[5470]: segfault at 1dbf3fffff8 ip 0000000000413bd5 sp 00007fbe14ff88a0 error 4 in cardano-node[400000+1b00000]
Jan 21 11:03:19 nix kernel: [59339.472653] cardano-node:w[6310]: segfault at 84109bc549 ip 00007fb3d2482b24 sp 00007fb3a27f7a98 error 6 in libc-2.27.so[7fb3d2331000+1aa000]
The last instance happened after a complete apt update && apt upgrade
followed by a reboot.
I do notice that the libc
version is still 2.27
which seems to come from Nix.
IOHK devops have checked all the production/staging/testing instances under their control and they do not see any cardano-node segfaults in the kernel logs. One thing to notice is that the devops machines are all running version 3.1.0
rather than 3.2.0
.
I am going to try 3.1.0
.
And I got a segfault on the 3.1.0
tag.
Jan 21 12:01:33 nix kernel: [62833.180323] cardano-node:w[7063]: segfault at 8410e19440 ip 00007f3228456d6e sp 00007f3217ffaa98 error 6 in libc-2.27.so[7f3228305000+1aa000]
Is there a solution to this?
I do not even know what is causing this, so obviously there is not yet a solution.
I also cannot recreate this reliably enough. How reliably can you recreate it?
Can't IOHK developers fix it?
I am an IOHK developer!
Please help me think of something to solve this problem. Do I need to change to an OS?
You do not need to change OS. The cardano node is devleoped on and for Linux.
This is an obscure and difficult to reproduce problem. You are going to need to show patience.
I am currently running the node under valgrind
to see if that can help me find the cause.
I'm running 3.1.0 and so far this has not been a problem.
What epoch/slot are you up to?
14651st slot of 54th epoch root 20 0 1.001t 959396 21924 S 120.6 23.8 31:46.79 cardano-node
17635th slot of 67th epoch.This is where the process is killed.
[74788.914053] cardano-node:w[8914]: segfault at 840ccae7a0 ip 00007f4de11d9c55 sp 00007f4dc59cea98 error 6 in libc-2.27.so[7f4de1088000+1aa000]
The node has been running for over 20 hours under valgrind
and has not yet failed.
Unfortunately, valgrind
runs things at last 10 times slower, so it has only managed to sync from zero to epoch 59 in that time.
I am also a bit concerned that valgrind
itself may reduce the chance of the bug being triggered.
I gave up on that valgrind
run so I could try running it under gdb
. Unfortunately the heavy use of exceptions within the node app means gdb
cannot be used.
Trying without valgrind
again.
Another segfault (without valgrind):
Jan 22 12:30:26 nix kernel: [150967.689171] cardano-node:w[26733]: segfault at 841eca1560 ip 00007f47db188d6e sp 00007f47a6ff8a98 error 6 in libc-2.27.so[7f47db037000+1aa000]
Jan 22 12:30:26 nix kernel: [150967.689177] Code: ff ff 0f 18 89 40 fe ff ff c5 fe 6f 01 c5 fe 6f 49 e0 c5 fe 6f 51 c0 c5 fe 6f 59 a0 48 81 e9 80 00 00 00 48 81 ea 80 00 00 00 <c4> c1 7d e7 01 c4 c1 7d e7 49 e0 c4 c1 7d e7 51 c0 c4 c1 7d e7 59
According to this, error 6
means "(data) write to an unmapped area".
This is probably due to some C code accessed via the C FFI.
I noticed that the Nix build that we are using is linking to version 5.11 of RocksDB whereas Debian has version 5.17 of that library.
Now trying to native Debian build of cardano-node
rather than the Nix build.
And I almost immediately got a segfault with the version I built without Nix. :cry:
Currently building a profiled version of cardano-node
under Debian. Hoping that gives a proper traceback. :crossed_fingers:
Ok, profling got me my first traceback:
*** Exception (reporting due to +RTS -xc): (THUNK), stack trace:
Pos.DB.Block.Internal.putSerializedBlunds.\,
called from Pos.DB.Block.Internal.putSerializedBlunds,
called from Pos.DB.Block.Internal.dbPutSerBlundsRealDefault,
called from Cardano.Wallet.Kernel.Mode.dbPutSerBlunds,
called from Pos.DB.Block.Load.putBlunds,
called from Pos.DB.Block.Slog.Logic.slogApplyBlocks,
called from Pos.DB.Block.Logic.Internal.applyBlocksDbUnsafeDo,
called from Pos.DB.Block.Logic.Internal.applyBlocksUnsafe.app,
called from Pos.DB.Block.Logic.Internal.applyBlocksUnsafe,
called from Pos.DB.Block.Logic.VAR.rollingVerifyAndApply.\,
called from Pos.DB.Block.Logic.VAR.rollingVerifyAndApply,
called from Pos.DB.Block.Logic.VAR.verifyAndApplyBlocks,
called from Pos.Network.Block.Logic.applyWithoutRollback.applyWithoutRollbackDo,
called from Pos.DB.GState.Lock.stateLockHelper.\,
Going to run it a couple of more times to make sure it crashes the same way each time.
Have run this a number of times and always get a traceback starting with putSerializedBlunds
.
@dcoutts asked why there is no C code in that traceback. I think that is because of incompatible debug formats. GHC's profiling uses Dwarf debugging symbols and the C code in the backtrace either may not be enabled or may be an incompatible format.
Has the problem been dealt with? @erikd
It has not. I have been assigned to other higher priority work.
No one is dealing with this problem right now? @erikd
No one that I am aware of. It seems that you are one of the few people who has been hitting this problem, and hence the priority has been downgraded.
My advice is to set the node up as a systemd
service or something and have it restart automatically.
We are also running into the same issue with all of our 3 nodes. This is the Dockerfile we are using to run our node:
FROM nixos/nix:2.3
ENV CARDANO_VERSION 3.2.0
RUN apk update && \
apk add git curl bzip2 bash
ADD nix.conf /etc/nix/nix.conf
WORKDIR /opt
RUN git clone https://github.com/input-output-hk/cardano-sl.git
WORKDIR /opt/cardano-sl
RUN git checkout $CARDANO_VERSION
RUN nix-build -A connectScripts.mainnet.wallet -o connect-to-mainnet && \
nix-build -A connectScripts.mainnet.explorer -o connect-explorer-to-mainnet && \
nix-build -A connectScripts.testnet.wallet -o connect-to-testnet && \
nix-build -A connectScripts.testnet.explorer -o connect-explorer-to-testnet
ADD entrypoint.sh .
ENTRYPOINT ["./entrypoint.sh"]
entrypoint.sh:
#!/usr/bin/env bash
if [[ "$1" == "explorer" ]]; then
if [[ -n "$TESTNET" ]]; then
exec ./connect-explorer-to-testnet --runtime-args --web-port="$RPC_PORT"
else
exec ./connect-explorer-to-mainnet --runtime-args --web-port="$RPC_PORT"
fi
fi
if [[ "$1" == "node" ]]; then
if [[ -n "$TESTNET" ]]; then
sed -i "s/--wallet-address 127.0.0.1:8090/--wallet-address 0.0.0.0:$RPC_PORT/g" connect-to-testnet
exec ./connect-to-testnet --runtime-args --no-tls
else
sed -i "s/--wallet-address 127.0.0.1:8090/--wallet-address 0.0.0.0:$RPC_PORT/g" connect-to-mainnet
exec ./connect-to-mainnet --runtime-args --no-tls
fi
fi
nix.conf:
cores = 0
max-jobs = auto
sandbox = false
substituters = https://hydra.iohk.io https://cache.nixos.org
trusted-substituters =
trusted-public-keys = hydra.iohk.io:f/Ea+s+dFdN+3Y/G+FDgSq+a5NEWhJGzdjvKNGv0/EQ= cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY=
@erikd
I can confirm that it happens both in v3.2.0 and v3.1.0. Any estimate when is this bug going to be taken care of? @erikd
The code base is this repo has been maintenance mode for close to a year. This bug is difficult to reproduce suggesting it is machine specific (insufficient memory?). As such it almost certainly will never be fixed.
However, the new code base in the cardano-node
repository is nearing completion and can actually connect to mainnet and operate as a full node. ~It does not yet work with Daedalus.~ I am not sure if it currently connects to Daedalus.
To provide the best advice on your best path forward, it would be useful if you could tell me your goals.
Thanks @erikd . So, is cardano-node
expected to be a de-facto replacement for cardano-sl
?
The goal is to be able to operate a full and reliable node for various operations on the mainnet (create transactions, verify existing ones, etc.).
So, is cardano-node expected to be a de-facto replacement for cardano-sl?
Yes!
What caused the mistake? After running for a while, the error is reported and the process is killed. Then I ran it again and was able to synchronize, but then the error occurred again.Is this the memory limit? This error occurs when the memory reaches 1GB.