input-output-hk / cardano-sl

Cryptographic currency implementing Ouroboros PoS protocol
Apache License 2.0
3.77k stars 630 forks source link

Segmentation fault #4268

Closed shenyaqi9527 closed 4 years ago

shenyaqi9527 commented 4 years ago

What caused the mistake? After running for a while, the error is reported and the process is killed. Then I ran it again and was able to synchronize, but then the error occurred again.Is this the memory limit? This error occurs when the memory reaches 1GB. image

erikd commented 4 years ago

Please cut and paste the actual log output rather than including a screenshot. Screenshots are unreadable for people with high DPI monitors.

shenyaqi9527 commented 4 years ago
[cardano-sl.*production*:Info:ThreadId 382] [2020-01-17 07:31:07.48 UTC] Trying to apply blocks w/o rollback. First 3: [MainBlockHeader:
    hash: 19b1f1eec6f9abb145114bfeda8cad76f2ff9fda4ff3e4ccdaf369f4052f9c3b
    previous block: 1b3c31eb41b0d38af18d9cf908a7c1848911d081271fb29b88d5b010932b2eba
    slot: 8364th slot of 152nd epoch
    difficulty: 3290025
    leader: pub:993a8f05
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:993a8f05, dPk = pub:89c29f8c } }
    block: v0.2.0
    software: cardano-sl:1
, MainBlockHeader:
    hash: a7b1b58758880db395796ce8b8cba290de717097a3a1de5b50e0a9923a2941f0
    previous block: 19b1f1eec6f9abb145114bfeda8cad76f2ff9fda4ff3e4ccdaf369f4052f9c3b
    slot: 8365th slot of 152nd epoch
    difficulty: 3290026
    leader: pub:0bdb1f5e
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:0bdb1f5e, dPk = pub:5fddeeda } }
    block: v0.2.0
    software: cardano-sl:1
, MainBlockHeader:
    hash: f25b190e0f961f05b111952b72a8cba6b30cffa4caac4c60eda64697471dc606
    previous block: a7b1b58758880db395796ce8b8cba290de717097a3a1de5b50e0a9923a2941f0
    slot: 8366th slot of 152nd epoch
    difficulty: 3290027
    leader: pub:1bc97a2f
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:1bc97a2f, dPk = pub:61261a95 } }
    block: v0.2.0
    software: cardano-sl:1
]
Last 3: [MainBlockHeader:
    hash: 197d5cfea25e990f6893e1250ea248ac25c0465db43abef95677b32c0d3ebbff
    previous block: 14b798e52f1b215d77b8be0ff1315ca25885f37dc27cacb7ac6db897d877f8a1
    slot: 8425th slot of 152nd epoch
    difficulty: 3290086
    leader: pub:9a6fa343
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:9a6fa343, dPk = pub:8b532076 } }
    block: v0.2.0
    software: cardano-sl:1
, MainBlockHeader:
    hash: 5a0795022c4786191d90eaf83f0a58c927d399a4779b1a61b78a0188de439c3a
    previous block: 197d5cfea25e990f6893e1250ea248ac25c0465db43abef95677b32c0d3ebbff
    slot: 8426th slot of 152nd epoch
    difficulty: 3290087
    leader: pub:0bdb1f5e
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:0bdb1f5e, dPk = pub:5fddeeda } }
    block: v0.2.0
    software: cardano-sl:1
, MainBlockHeader:
    hash: d571750aee77c352ae4a3be20b1f229d4e3d6c549668a06b2a19b3b8bc301843
    previous block: 5a0795022c4786191d90eaf83f0a58c927d399a4779b1a61b78a0188de439c3a
    slot: 8427th slot of 152nd epoch
    difficulty: 3290088
    leader: pub:0bdb1f5e
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:0bdb1f5e, dPk = pub:5fddeeda } }
    block: v0.2.0
    software: cardano-sl:1
]
[cardano-sl.*production*:Debug:ThreadId 382] [2020-01-17 07:31:07.48 UTC] MemPool metrics wait: ApplyBlock queue length is 1
[cardano-sl.*production*:Debug:ThreadId 382] [2020-01-17 07:31:07.48 UTC] MemPool metrics acquire: ApplyBlock wait time was 12mcs
[cardano-sl.*production*:Info:ThreadId 382] [2020-01-17 07:31:07.48 UTC] Verifying and applying blocks...
[cardano-sl.*production*:Debug:ThreadId 382] [2020-01-17 07:31:07.48 UTC] Rolling: verifying
[cardano-sl.*production*:Debug:ThreadId 382] [2020-01-17 07:31:07.48 UTC] verifyBlocksPrefix: 64
[cardano-sl.*production*:Info:ThreadId 382] [2020-01-17 07:31:07.48 UTC] slogVerifyBlocks: Consensus era is Original
[cardano-sl.*production*:Debug:ThreadId 382] [2020-01-17 07:31:07.71 UTC] Rolling: Verification done, applying unsafe block
[cardano-sl.node:Debug:ThreadId 316] [2020-01-17 07:31:07.72 UTC] applying some blocks (non-rollback)
[cardano-sl.*production*:Info:ThreadId 382] [2020-01-17 07:31:07.74 UTC] Verifying and applying blocks done
[cardano-sl.*production*:Debug:ThreadId 382] [2020-01-17 07:31:07.74 UTC] MemPool metrics release: ApplyBlock modify time was 258859mcs size is 0
[cardano-sl.*production*:Debug:ThreadId 382] [2020-01-17 07:31:07.74 UTC] Not relaying block in recovery mode
[cardano-sl.*production*:Info:ThreadId 382] [2020-01-17 07:31:07.74 UTC] Blocks have been adopted: [19b1f1eec6f9abb1, a7b1b58758880db3, f25b190e0f961f05, 0e907e6cf3f6ca4d, d4f74c5d312b0549, e4d06fea6f988ce9, ee279270b61135e9, 6c7998c9ba54d17e, 94853e29ce7200e6, f08ed74a4fcfd8e3, 834b772bd7901ffb, fb514730332114d2, c8a4b3a5697044ee, 5c9a27718384810e, 98573eeb0e53dbd4, a3be980a5027b156, 3713b244551e732b, ff9fa80350026e9e, 1704bdb34a16c6be, 229b40d03c219294, dfcbec3a426b4ca3, 860e4a48c46e67fa, 40b75fe17d58e8bf, faddbd98928a5527, 7b1d29bc099b5b19, ab08ff7a02399e92, ae699dcf30f5ef08, a6d99576b5002d9a, 62b679c913e1c9ee, 4d01d5340c42fdb3, 196d33de29e35ae9, 98ee850aac9a1b9d, 4f3037f91ccd71fd, 24fbc16a4479d02f, 53ac78eb4d6ab7f2, d48b6feb41138da5, 676d449b1ca1fb74, 3327651659423dee, 1a3c9a6c2f095cd7, 16b36ed4da3e05ff, 2223acc65da5f5db, e8d4e12a616f4227, 443ff6e4fffeac0b, 22db605e0059c4d8, 643d9645b54e09b5, fdf7bfe14b34aed0, 6a1650622dec4d07, 1af30b82a5ee01a8, 1e585328b2eda019, db55785d3ea21869, 0ac37f9d03244062, cc9d0a297ea9af53, ebffb3fc15d7b62b, e7b81541b23ace3a, a48604d65cef0180, 7af2bd4c948d283f, 013dc0ea380f67a1, de88e7e0c354e93a, a762148ba5fbf0a6, 432652c976cbbdc4, 14b798e52f1b215d, 197d5cfea25e990f, 5a0795022c478619, d571750aee77c352]
^Z[9]   Segmentation fault      nohup sudo ./mainnet.sh > log.log 2>&1  (wd: /data/ada/cardano-sl)
(wd now: /data/ada/cardano-sl/state-wallet-mainnet/logs/pub)
erikd commented 4 years ago

What version (use the git hash) is this?

Is this repeatable? Ie, it you run it again, do you get the same error?

shenyaqi9527 commented 4 years ago

master 1a792d7cd [origin/master] Merge #4242 yes,this is repeatable @erikd

erikd commented 4 years ago

That git hash is from Sep 2019.

Try checking out tag 3.2.0 which is from Nov 22, 2019.

If that does not fix it, I will look at this on Monday morning. Its currently 9pm Friday here.

shenyaqi9527 commented 4 years ago

thank you @erikd

shenyaqi9527 commented 4 years ago

@erikd I use "nix-build -A connectScripts.mainnet.wallet -o mainnet.sh" command to build. mainnet.sh:

!/nix/store/vlb7kcc1k035vpyrgsj9kk7380yh68wd-bash-4.4-p23/bin/bash

set -euo pipefail

if [[ "${1-}" == "--delete-state" ]]; then echo "Deleting state-wallet-mainnet ... " rm -Rf state-wallet-mainnet shift fi if [[ "${1-}" == "--runtime-args" ]]; then RUNTIME_ARGS="${2-}" shift 2 else RUNTIME_ARGS="" fi

echo "Keeping state in state-wallet-mainnet" mkdir -p state-wallet-mainnet/logs

echo "Launching a node connected to 'mainnet' ..." export LC_ALL=en_GB.UTF-8 export LANG=en_GB.UTF-8

if [ ! -d state-wallet-mainnet/tls ]; then mkdir -p state-wallet-mainnet/tls/server && mkdir -p state-wallet-mainnet/tls/client /nix/store/ra45xgy1ngy9bpn12h5fib7m81925i80-cardano-sl-tools-3.2.0-exe-cardano-x509-certificates/bin/cardano-x509-certificates \ --server-out-dir state-wallet-mainnet/tls/server \ --clients-out-dir state-wallet-mainnet/tls/client \ --configuration-file /nix/store/r02jsbcld1cmy47y1cxr8c9l6y9z7a8n-tls-config-mainnet.yaml \ --configuration-key mainnet_full fi ln -sf /nix/store/0gzajk6rskv7xigvwhgly1zrn3m75d4r-curl-wallet-mainnet state-wallet-mainnet/curl

exec /nix/store/ya8iqz0l34w9mszd06ir3pchasryqz4a-cardano-wallet-3.2.0-exe-cardano-node/bin/cardano-node \ --configuration-file /nix/store/j4rz117v3paa3ys3abfkxacvghyd7chn-cardano-sl-config/lib/configuration.yaml --configuration-key mainnet_full \ --tlscert state-wallet-mainnet/tls/server/server.crt \ --tlskey state-wallet-mainnet/tls/server/server.key \ --tlsca state-wallet-mainnet/tls/server/ca.crt \ --log-config /nix/store/j4rz117v3paa3ys3abfkxacvghyd7chn-cardano-sl-config/log-configs/connect-to-cluster.yaml \ --topology "/nix/store/kiwxslk8q90j8rrjj4vqnnc9np5a9bhy-topology-mainnet" \ --logs-prefix "state-wallet-mainnet/logs" \ --db-path "state-wallet-mainnet/db" \ --wallet-db-path 'state-wallet-mainnet/wallet-db' \ --no-client-auth \ \ --keyfile state-wallet-mainnet/secret.key \ --wallet-address 0.0.0.0:8090 \ --wallet-doc-address 127.0.0.1:8091 \ --ekg-server 127.0.0.1:8000 --metrics \ +RTS -N2 -qg -A1m -I0 -T -RTS \ \ $RUNTIME_ARGS

erikd commented 4 years ago

I use "nix - build - A connectScripts.mainnet.wallet - o mainnet.sh" command to build.

There seem to be some extra spaces around the - character in that. It should be nix-build -A and -o.

I just did:

> git checkout 3.2.0 -b tag-3.2.0
> nix-build -A connectScripts.mainnet.wallet -o mainnet.sh

and it worked as expected.

shenyaqi9527 commented 4 years ago

@erikd That's probably why I copied it.It added the space for me automatically.

The process ended abruptly. same error

[cardano-sl.*production*:Debug:1689] [2020-01-20 02:54:16.92 UTC] Rolling: verifying
[cardano-sl.*production*:Debug:1689] [2020-01-20 02:54:16.92 UTC] verifyBlocksPrefix: 64
[cardano-sl.*production*:Info:1689] [2020-01-20 02:54:16.92 UTC] slogVerifyBlocks: Consensus era is Original
^Z[5]   Segmentation fault      sudo nohup ./mainnet.sh > log.log 2>&1

[6]+  Stopped                 tail -200f log.log

Do I need to upgrade my server? now my server : 2c 4GB RAM

erikd commented 4 years ago

4G should be enough RAM, especically if nothing else is happening on that machine.

I am currently running the ./mainnet.sh script. Any idea what epoch you are getting up to when it segfaults?

erikd commented 4 years ago

And now it segfaults for me too! At 5046th slot of 3rd epoch.

shenyaqi9527 commented 4 years ago

I think this error occurs when the process reaches 1gb of memory.Because I restarted the process it can continue to synchronize.I rebooted once, now at 10365th slot of 13th epoch

erikd commented 4 years ago

I am running on a 16G VM, and I was able to recreate that problem so that is not it.

Oh, hang on, you are running on a 64 bit CPU aren't you?

erikd commented 4 years ago

I also checked out the HEAD of the develop branch and that synced to epoch 4 without a problem.

Then I switched back to the 3.2.0 tag ans synced from scratch to epoch 5, again without a problem.

I wonder if there is a peer somewhere on the network that is serving up corrupted blocks.

shenyaqi9527 commented 4 years ago

admin@ada:/data/ada$ getconf LONG_BIT 64

On Friday, I restarted the node an infinite number of times and finished synchronizing. On Monday, the process was killed.

erikd commented 4 years ago

What base OS and OS version are you running this on?

shenyaqi9527 commented 4 years ago

Ubuntu 18.04

erikd commented 4 years ago

Ubuntu 18.04 should be fine.

I would try deleting the ./state-wallet-mainnet directory, and then try resyncing. Each time it segfaults, record the slot and epoch number and restart it. When you have about 10 entries, post the list here.

shenyaqi9527 commented 4 years ago

8280th slot of 8th epoch 18274th slot of 18th epoch 20293rd slot of 19th epoch 7566th slot of 31st epoch 1618th slot of 33rd epoch 4089th slot of 36th epoch 15837th slot of 55th epoch

erikd commented 4 years ago

Ok, if you delete the ./state-wallet-mainnet directory and run it again listing the first 10 entries here.

shenyaqi9527 commented 4 years ago

What do you mean?

erikd commented 4 years ago

Delete ./state-wallet-mainnet directory and do the same test again. Would be useful to know if we get the same results.

shenyaqi9527 commented 4 years ago

Do I need to delete this directory when I run it again

erikd commented 4 years ago

Yes. Thats is the state directory where the node stores blocks.

shenyaqi9527 commented 4 years ago
[cardano-sl.*production*:Debug:1689] [2020-01-20 04:56:29.75 UTC] Handling block w/ LCA, which is a07d3104
[cardano-sl.*production*:Info:1689] [2020-01-20 04:56:29.75 UTC] Trying to apply blocks w/o rollback. First 3: [MainBlockHeader:
    hash: 92d68c2ba61d115b3c53f1c857c77355062c2ee76680e8afed94980e7fbba239
    previous block: a07d310498f2417e1d6ade2dd2e3da8f802995407845d164616053dfc683a44d
    slot: 18033rd slot of 26th epoch
    difficulty: 579559
    leader: pub:9a6fa343
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:9a6fa343, dPk = pub:8b532076 } }
    block: v0.1.0
    software: cardano-sl:0
, MainBlockHeader:
    hash: c5342a48f472640576a177d2992b77f7939e406e463d00bc480ddb339747e85f
    previous block: 92d68c2ba61d115b3c53f1c857c77355062c2ee76680e8afed94980e7fbba239
    slot: 18034th slot of 26th epoch
    difficulty: 579560
    leader: pub:0bdb1f5e
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:0bdb1f5e, dPk = pub:5fddeeda } }
    block: v0.1.0
    software: cardano-sl:0
, MainBlockHeader:
    hash: d8462f486a46f0688786e3880c7982a6f69ce8f0adcf18c1e1089bb9480b7490
    previous block: c5342a48f472640576a177d2992b77f7939e406e463d00bc480ddb339747e85f
    slot: 18035th slot of 26th epoch
    difficulty: 579561
    leader: pub:9a6fa343
    signature: BlockPSignatureHeavy: Proxy signature { psk = ProxySk { w = #0, iPk = pub:9a6fa343, dPk = pub:8b532076 } }
    block: v0.1.0
    software: cardano-sl:0

The log is not finished. The process is killed.But this time no errors were reported.

erikd commented 4 years ago

@shenyaqi9527 I need a list of where (ie epoch and slot number) the process gets killed, started from scratch. Please run it again so I can compare it with the last list.

shenyaqi9527 commented 4 years ago

Do I need to delete "./state-wallet-mainnet"

shenyaqi9527 commented 4 years ago
[cardano-sl.*production*:Debug:1686] [2020-01-20 05:11:51.96 UTC] verifyBlocksPrefix: 64
[cardano-sl.*production*:Info:1686] [2020-01-20 05:11:51.96 UTC] slogVerifyBlocks: Consensus era is Original
cardano-node: internal error: evacuate: strange closure type 0
    (GHC version 8.4.4 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
erikd commented 4 years ago

I am really beginning to suspect that your machine is having hardware issues. Can you run some form of diagnostic on it?

shenyaqi9527 commented 4 years ago

How to diagnose? This is the server of aliyun.

erikd commented 4 years ago

I would try memtext86+ first and then contact your provider.

erikd commented 4 years ago

As a point of reference, I have seen this issue exactly once. I have since restarted and synced to the 135th epoch (and its still going) without a recurrence of the segfault.

shenyaqi9527 commented 4 years ago

I changed the server and got the same error

erikd commented 4 years ago

What is the git hash?

shenyaqi9527 commented 4 years ago

tag-3.2.0 5d0a227fb Merge #4252 Are there any commands that need to be executed before creation?

erikd commented 4 years ago

No commands require other than what you have been running.

Mine is running quite happily using up to about 15% of my RAM on a 16G rVM.

Maybe try running it on an 8G or 16G machine.

shenyaqi9527 commented 4 years ago

Does nix limit memory usage?I used 8GB of memory with the same result. slot: 7832nd slot of 8th epoch.

erikd commented 4 years ago

Does nix limit memory usage?

I don't think so.

I would still like a list of the epoch/slot info for the first 10 failures.

shenyaqi9527 commented 4 years ago

I tried a few times 18035th slot of 26th epoch 6841st slot of 3rd epoch 16250th slot of 3rd epoch 15769th slot of 4th epoch 20023rd slot of 1st epoch 7832nd slot of 8th epoch

erikd commented 4 years ago

So its not deterministic.

When you moved machines, did you keep the same disk image or reinstall from scratch?

shenyaqi9527 commented 4 years ago

reinstall from scratch

erikd commented 4 years ago

I have just about run out of ideas :cry: .

shenyaqi9527 commented 4 years ago

I can't do anything about it now.

erikd commented 4 years ago

@shenyaqi9527 Does the dmesg output on your machine list any segfaults?

shenyaqi9527 commented 4 years ago

How do I use this command

erikd commented 4 years ago

sudo dmesg | grep segfault

shenyaqi9527 commented 4 years ago
admin@ada:/$ sudo dmesg | grep segfault
[ 1547.424449] cardano-node:w[5554]: segfault at 840bffe2a0 ip 00007fae70921c55 sp 00007fae6d919a98 error 6 in libc-2.27.so[7fae707d0000+1aa000]
[ 1910.998425] cardano-node:w[5786]: segfault at 8402c16940 ip 00007fca29cefc55 sp 00007fca05bd4a98 error 6 in libc-2.27.so[7fca29b9e000+1aa000]
[ 1991.020628] cardano-node:w[5883]: segfault at 84044bfac0 ip 00007fc7aee9ec55 sp 00007fc77e7f7a98 error 6 in libc-2.27.so[7fc7aed4d000+1aa000]
[ 2427.408873] cardano-node:w[5942]: segfault at 840c2d55c0 ip 00007ff7f4bcfc55 sp 00007ff7dbffaa98 error 6 in libc-2.27.so[7ff7f4a7e000+1aa000]
erikd commented 4 years ago

I got two more in the last 30 minutes. I am seeing something similar to you:

[6124226.626857] cardano-node:w[13755]: segfault at 840554dfc0 ip 00007fa4abdd9d6e sp 00007fa49bffaa98 error 6 in libc-2.27.so[7fa4abc88000+1aa000]
[6138577.260721] cardano-node:w[24086]: segfault at 84079fd1d0 ip 00007f5a12efeb24 sp 00007f5a0a7f7a98 error 6 in libc-2.27.so[7f5a12dad000+1aa000]
[6139714.043880] cardano-node:w[24190]: segfault at 84009cd480 ip 00007fb987390d6e sp 00007fb97e7f7a98 error 6 in libc-2.27.so[7fb98723f000+1aa000]

I'm running this on Debian and I have just noticed there is a libc-2.29 available, so I am going to try upgrading to that.

shenyaqi9527 commented 4 years ago

What am I going to do?

erikd commented 4 years ago

Wait for me to report back after I do a complete upgrade of my system, reboot and retest?