Segmentation fault - Githubissues

shenyaqi9527 commented 4 years ago

What caused the mistake? After running for a while, the error is reported and the process is killed. Then I ran it again and was able to synchronize, but then the error occurred again.Is this the memory limit? This error occurs when the memory reaches 1GB.

shenyaqi9527 commented 4 years ago

ok

erikd commented 4 years ago

Its the end of my day here so you may not hear from me until tomorrow.

erikd commented 4 years ago

@shenyaqi9527 is there any chance you are running out of disk space?

shenyaqi9527 commented 4 years ago

my disk is 1TB

erikd commented 4 years ago

Ok, probably not running out of disk space, but df -h will tell you for sure.

erikd commented 4 years ago

Since starting investigation of this I have encounter 5 instances of this. From kern.log:

# grep cardano-node /var/log/kern.log
Jan 20 14:01:36 nix kernel: [6124226.626857] cardano-node:w[13755]: segfault at 840554dfc0 ip 00007fa4abdd9d6e sp 00007fa49bffaa98 error 6 in libc-2.27.so[7fa4abc88000+1aa000]
Jan 20 18:00:47 nix kernel: [6138577.260721] cardano-node:w[24086]: segfault at 84079fd1d0 ip 00007f5a12efeb24 sp 00007f5a0a7f7a98 error 6 in libc-2.27.so[7f5a12dad000+1aa000]
Jan 20 18:19:44 nix kernel: [6139714.043880] cardano-node:w[24190]: segfault at 84009cd480 ip 00007fb987390d6e sp 00007fb97e7f7a98 error 6 in libc-2.27.so[7fb98723f000+1aa000]
Jan 20 19:17:49 nix kernel: [ 2608.843672] traps: cardano-node:w[1150] general protection fault ip:7fae31864d6e sp:7fadfcff4a98 error:0 in libc-2.27.so[7fae31713000+1aa000]
Jan 21 07:34:36 nix kernel: [46815.966810] cardano-node:w[5470]: segfault at 1dbf3fffff8 ip 0000000000413bd5 sp 00007fbe14ff88a0 error 4 in cardano-node[400000+1b00000]
Jan 21 11:03:19 nix kernel: [59339.472653] cardano-node:w[6310]: segfault at 84109bc549 ip 00007fb3d2482b24 sp 00007fb3a27f7a98 error 6 in libc-2.27.so[7fb3d2331000+1aa000]

The last instance happened after a complete apt update && apt upgrade followed by a reboot.

I do notice that the libc version is still 2.27 which seems to come from Nix.

erikd commented 4 years ago

IOHK devops have checked all the production/staging/testing instances under their control and they do not see any cardano-node segfaults in the kernel logs. One thing to notice is that the devops machines are all running version 3.1.0 rather than 3.2.0.

I am going to try 3.1.0.

erikd commented 4 years ago

And I got a segfault on the 3.1.0 tag.

Jan 21 12:01:33 nix kernel: [62833.180323] cardano-node:w[7063]: segfault at 8410e19440 ip 00007f3228456d6e sp 00007f3217ffaa98 error 6 in libc-2.27.so[7f3228305000+1aa000]

shenyaqi9527 commented 4 years ago

Is there a solution to this?

erikd commented 4 years ago

I do not even know what is causing this, so obviously there is not yet a solution.

I also cannot recreate this reliably enough. How reliably can you recreate it?

shenyaqi9527 commented 4 years ago

Can't IOHK developers fix it?

erikd commented 4 years ago

I am an IOHK developer!

shenyaqi9527 commented 4 years ago

Please help me think of something to solve this problem. Do I need to change to an OS?

erikd commented 4 years ago

You do not need to change OS. The cardano node is devleoped on and for Linux.

This is an obscure and difficult to reproduce problem. You are going to need to show patience.

erikd commented 4 years ago

I am currently running the node under valgrind to see if that can help me find the cause.

shenyaqi9527 commented 4 years ago

I'm running 3.1.0 and so far this has not been a problem.

erikd commented 4 years ago

What epoch/slot are you up to?

shenyaqi9527 commented 4 years ago

14651st slot of 54th epoch root 20 0 1.001t 959396 21924 S 120.6 23.8 31:46.79 cardano-node

shenyaqi9527 commented 4 years ago

17635th slot of 67th epoch.This is where the process is killed.

[74788.914053] cardano-node:w[8914]: segfault at 840ccae7a0 ip 00007f4de11d9c55 sp 00007f4dc59cea98 error 6 in libc-2.27.so[7f4de1088000+1aa000]

erikd commented 4 years ago

The node has been running for over 20 hours under valgrind and has not yet failed.

Unfortunately, valgrind runs things at last 10 times slower, so it has only managed to sync from zero to epoch 59 in that time.

I am also a bit concerned that valgrind itself may reduce the chance of the bug being triggered.

erikd commented 4 years ago

I gave up on that valgrind run so I could try running it under gdb. Unfortunately the heavy use of exceptions within the node app means gdb cannot be used.

Trying without valgrind again.

erikd commented 4 years ago

Another segfault (without valgrind):

Jan 22 12:30:26 nix kernel: [150967.689171] cardano-node:w[26733]: segfault at 841eca1560 ip 00007f47db188d6e sp 00007f47a6ff8a98 error 6 in libc-2.27.so[7f47db037000+1aa000]
Jan 22 12:30:26 nix kernel: [150967.689177] Code: ff ff 0f 18 89 40 fe ff ff c5 fe 6f 01 c5 fe 6f 49 e0 c5 fe 6f 51 c0 c5 fe 6f 59 a0 48 81 e9 80 00 00 00 48 81 ea 80 00 00 00 <c4> c1 7d e7 01 c4 c1 7d e7 49 e0 c4 c1 7d e7 51 c0 c4 c1 7d e7 59

According to this, error 6 means "(data) write to an unmapped area".

This is probably due to some C code accessed via the C FFI.

erikd commented 4 years ago

I noticed that the Nix build that we are using is linking to version 5.11 of RocksDB whereas Debian has version 5.17 of that library.

Now trying to native Debian build of cardano-node rather than the Nix build.

erikd commented 4 years ago

And I almost immediately got a segfault with the version I built without Nix. :cry:

erikd commented 4 years ago

Currently building a profiled version of cardano-node under Debian. Hoping that gives a proper traceback. :crossed_fingers:

erikd commented 4 years ago

Ok, profling got me my first traceback:

*** Exception (reporting due to +RTS -xc): (THUNK), stack trace:
  Pos.DB.Block.Internal.putSerializedBlunds.\,
  called from Pos.DB.Block.Internal.putSerializedBlunds,
  called from Pos.DB.Block.Internal.dbPutSerBlundsRealDefault,
  called from Cardano.Wallet.Kernel.Mode.dbPutSerBlunds,
  called from Pos.DB.Block.Load.putBlunds,
  called from Pos.DB.Block.Slog.Logic.slogApplyBlocks,
  called from Pos.DB.Block.Logic.Internal.applyBlocksDbUnsafeDo,
  called from Pos.DB.Block.Logic.Internal.applyBlocksUnsafe.app,
  called from Pos.DB.Block.Logic.Internal.applyBlocksUnsafe,
  called from Pos.DB.Block.Logic.VAR.rollingVerifyAndApply.\,
  called from Pos.DB.Block.Logic.VAR.rollingVerifyAndApply,
  called from Pos.DB.Block.Logic.VAR.verifyAndApplyBlocks,
  called from Pos.Network.Block.Logic.applyWithoutRollback.applyWithoutRollbackDo,
  called from Pos.DB.GState.Lock.stateLockHelper.\,

Going to run it a couple of more times to make sure it crashes the same way each time.

erikd commented 4 years ago

Have run this a number of times and always get a traceback starting with putSerializedBlunds.

@dcoutts asked why there is no C code in that traceback. I think that is because of incompatible debug formats. GHC's profiling uses Dwarf debugging symbols and the C code in the backtrace either may not be enabled or may be an incompatible format.

shenyaqi9527 commented 4 years ago

Has the problem been dealt with? @erikd

erikd commented 4 years ago

It has not. I have been assigned to other higher priority work.

shenyaqi9527 commented 4 years ago

No one is dealing with this problem right now? @erikd

erikd commented 4 years ago

No one that I am aware of. It seems that you are one of the few people who has been hitting this problem, and hence the priority has been downgraded.

My advice is to set the node up as a systemd service or something and have it restart automatically.

DZDomi commented 4 years ago

We are also running into the same issue with all of our 3 nodes. This is the Dockerfile we are using to run our node:

FROM nixos/nix:2.3

ENV CARDANO_VERSION 3.2.0

RUN apk update && \
    apk add git curl bzip2 bash

ADD nix.conf /etc/nix/nix.conf

WORKDIR /opt

RUN git clone https://github.com/input-output-hk/cardano-sl.git

WORKDIR /opt/cardano-sl

RUN git checkout $CARDANO_VERSION
RUN nix-build -A connectScripts.mainnet.wallet -o connect-to-mainnet && \
    nix-build -A connectScripts.mainnet.explorer -o connect-explorer-to-mainnet && \
    nix-build -A connectScripts.testnet.wallet -o connect-to-testnet && \
    nix-build -A connectScripts.testnet.explorer -o connect-explorer-to-testnet

ADD entrypoint.sh .

ENTRYPOINT ["./entrypoint.sh"]

entrypoint.sh:

#!/usr/bin/env bash

if [[ "$1" == "explorer" ]]; then
  if [[ -n "$TESTNET" ]]; then
    exec ./connect-explorer-to-testnet --runtime-args --web-port="$RPC_PORT"
  else
    exec ./connect-explorer-to-mainnet --runtime-args --web-port="$RPC_PORT"
  fi
fi

if [[ "$1" == "node" ]]; then
  if [[ -n "$TESTNET" ]]; then
    sed -i "s/--wallet-address 127.0.0.1:8090/--wallet-address 0.0.0.0:$RPC_PORT/g" connect-to-testnet
    exec ./connect-to-testnet --runtime-args --no-tls
  else
    sed -i "s/--wallet-address 127.0.0.1:8090/--wallet-address 0.0.0.0:$RPC_PORT/g" connect-to-mainnet
    exec ./connect-to-mainnet --runtime-args --no-tls
  fi
fi

nix.conf:

cores = 0
max-jobs = auto
sandbox = false
substituters = https://hydra.iohk.io https://cache.nixos.org
trusted-substituters =
trusted-public-keys  = hydra.iohk.io:f/Ea+s+dFdN+3Y/G+FDgSq+a5NEWhJGzdjvKNGv0/EQ= cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY=

@erikd

michdr commented 4 years ago

I can confirm that it happens both in v3.2.0 and v3.1.0. Any estimate when is this bug going to be taken care of? @erikd

erikd commented 4 years ago

The code base is this repo has been maintenance mode for close to a year. This bug is difficult to reproduce suggesting it is machine specific (insufficient memory?). As such it almost certainly will never be fixed.

However, the new code base in the cardano-node repository is nearing completion and can actually connect to mainnet and operate as a full node. ~It does not yet work with Daedalus.~ I am not sure if it currently connects to Daedalus.

To provide the best advice on your best path forward, it would be useful if you could tell me your goals.

michdr commented 4 years ago

Thanks @erikd . So, is cardano-node expected to be a de-facto replacement for cardano-sl?

The goal is to be able to operate a full and reliable node for various operations on the mainnet (create transactions, verify existing ones, etc.).

erikd commented 4 years ago

So, is cardano-node expected to be a de-facto replacement for cardano-sl?

Yes!

input-output-hk / cardano-sl

Segmentation fault #4268