rhmaxdotorg / pulsechain-validator

PulseChain Validator Automated Setup Scripts
https://launchpad.pulsechain.com
38 stars 13 forks source link

Out of Disk Space! #37

Closed dkim777 closed 6 months ago

dkim777 commented 8 months ago

So... My validator machine with 2TB drive, ran out of space 3 days ago. I tried to run prune-geth.sh script but got following error.

Fatal: Could not open database: write /opt/geth/data/geth/chaindata/12748549.ldb: no space left on device

Any ideas? maybe I can manually delete some large files and re-start the pruning script again? Thanks

rhmaxdotorg commented 8 months ago

Sounds like indeed you have ran out of disk space... interesting, yep that's what pruning is supposed to prevent.

Once you run out of space, its hard to run any apps or scripts. Here's what GPT says and there's a few things that might make sense in this situation (haven't ran into it myself for a long time):

Running out of disk space on a machine running a blockchain node, such as Geth for Ethereum, can indeed cause operational issues including failures to execute maintenance scripts like pruning. The error you're encountering suggests that the database operations required by the prune-geth.sh script cannot proceed due to insufficient disk space. Here are some steps you can take to address this issue:

  1. Backup Important Data: Before you proceed with any file deletion or modification, ensure you have backups of all critical data. This includes wallet files, configuration files, and anything else you cannot afford to lose.

  2. Clear Non-Essential Files: Start with non-blockchain related files. Check your system for large files or directories that are no longer needed. This might include logs, temporary files, or old data backups.

To manually clear some space by removing unnecessary files or directories, you might want to start by identifying large files or directories within your Geth data directory or elsewhere on your device. Here are some commands you can use on a Linux-based system:

  1. Navigate to the Geth Data Directory: Assuming /opt/geth/data is your Geth data directory, you would use:

    cd /opt/geth/data
  2. Check Disk Usage: To find out which files or directories are using the most space, you can use the du command. To see the top directories/files by size in your current directory (and sort them in human-readable format), you can use:

    du -ah . | sort -rh | head -20
  3. Identify Non-essential Large Files: Look for large files or directories that are known to be non-essential. This could include old chain data, logs, or temporary files that are not currently needed.

  4. Remove Unnecessary Files: Once you've identified large, non-essential files, you can remove them using the rm command. For example, if you've identified a large log file that is not needed, you can remove it like so:

    rm /path/to/large/log/file.log

    Warning: Be very careful with the rm command, especially if using wildcard characters (*). Accidentally deleting important files could result in data loss or node malfunction.

  5. Clear Old Logs: If your Geth logs are taking up a lot of space and you're sure you don't need them, you can clear them by simply deleting or compressing them. For instance, to delete all .log files in a logs directory:

    rm /opt/geth/data/logs/*.log
  6. Restart the Pruning Script: After freeing up some space, try running the pruning script again to see if it proceeds successfully.

  7. Monitor Disk Space: Keep an eye on disk space usage to prevent similar issues in the future. You can use df -h to monitor disk space usage.

Note: Always ensure you have backups of critical data before deleting any files. Deleting the wrong file could potentially corrupt your node's data or cause other unforeseen issues.

  1. Reduce Geth Log Size: Geth logs can become quite large over time. If you have logging enabled with verbosity, consider deleting old log files or compressing them if you need to retain the logs for future analysis.

  2. Manually Delete Old Chaindata Files: This is more risky and should be done with caution. Geth stores blockchain data in the chaindata directory. Over time, especially if you're running a full node, this can grow significantly. You can manually delete some of the older ldb files, but this could potentially corrupt your blockchain database, requiring a resynchronization from scratch. If you decide to take this route, ensure you have backups and are prepared for a potential full resync.

  3. Prune the Blockchain: Since your initial attempt to prune the blockchain using prune-geth.sh failed due to disk space issues, after freeing up some space by other means, try running the pruning script again. Pruning removes old state trie data that is no longer necessary for operation, significantly reducing disk space usage.

  4. Increase Disk Space: If possible, consider adding more disk space to your system. This could be through physically adding another drive, resizing your partition (if using a virtual machine and space is available), or migrating to a larger disk.

  5. Monitor Disk Usage: After resolving this immediate issue, it's a good practice to monitor disk usage regularly to avoid similar problems. Tools like du, df, and graphical interfaces like gparted can help manage and monitor disk space usage.

If you are not comfortable performing these steps, especially manually deleting files from the chaindata directory, seek assistance from someone with experience in managing Ethereum nodes or consider asking for help on relevant forums or communities. Safety and data integrity should be your top priorities.

dkim777 commented 8 months ago

Easiest way seem to be imaging my full drive to larger external drive and resize the LVM partition. cost $300 more money to buy 4TB drive.

rhmaxdotorg commented 8 months ago

Awesome, if you have more notes that you think will help people that may run into the same issue, feel free to comment.

Closing for now.

dkim777 commented 7 months ago

I've finally managed to migrate my full 2TB drive to 4TB drive and expanded the LVM. It's been down roughly 4weeks, and I restarted my validators a week ago but still not working. Last time when it went down for 24hours, it needed about 3days to recover.

Here are some error messages.

journalctl -u lighthouse-beacon.service -f

info: chain not fully verified, block and attestation production disabled until execution engine syncs, service: slot_notifier WARN Execution endpoint is not synced

journalctl -u lighthouse-validator.service -f

CRIT Error during attestation routine slot: 2819713, committee_index: 5, error: "Some endpoints failed, num_failed: 1 http://localhost:5052/ => RequestFailed(\"Failed to produce attestation data: ServerMessage(ErrorMessage { code: 500, message: \\"UNHANDLED_ERROR: HeadBlockNotFullyVerified ERRO No synced beacon nodes

any idea where I should start troubleshooting? Note that I haven't updated the client or the OS(Ubuntu 22.04.2), it's the original from a year ago.

Thanks!!

rhmaxdotorg commented 7 months ago

Hmm I would probably use the reset script (which keeps synced blockchain data by default) and install clients (which will be latest versions) again.

Unless there's some reason you'd prefer not to do so. But that can often solve some subtle bugs, however there is of course some downtime (as discussed on the wiki).

dkim777 commented 7 months ago

Are you talking about reset-rpc or reset-validator? if it's reset-validator, what information should I save before I run and reset everything, ,any cert keys? Thanks.

rhmaxdotorg commented 7 months ago

reset-validator.sh

It just wipes and reinstalls clients, saves blockchain data (by default, can modify script to remove blockchain data too if becomes necessary) and you just import the keys and stuff again like the usual process.

You can check the wiki for more info / examples

dkim777 commented 7 months ago

since mine is still the original, should I download the new clone and use the updated "reset-validator.sh" will this remove and install new clients all in one fell sweep? or use old reset-validator.sh and do update-client.sh? sorry for many questions.

rhmaxdotorg commented 7 months ago

no worries.

if you want to reset the validator, you'd use the reset script. but if you want to update the clients only, you'd use the update script (however, check the wiki because if you used the script before July 2023 if I remember correctly, you'd need to do a soft reset anyways).

with your situation, I'd probably do a reset instead of client update. again, you can weigh the options and choose accordingly.

dkim777 commented 7 months ago

So, I looked at the reset-validator.sh script and it deletes geth and lighthouse only and doesn't look like it re-installs. Does it mean I should follow up by running the pulsechain-validator-setup.sh and reset-monitoring.sh ? My codes are from before July 2023. Thanks.

rhmaxdotorg commented 7 months ago

Yes, reset-validator resets the system by removing the validator clients and then afterwards you can run the setup script again.

In addition, if you having monitoring installed, I would probably do this:

1) run the reset-validator script 2) run the reset monitoring script 3) run the setup script 4) run the monitoring setup script

And since you installed prior to July 2023, that should get everything back up to the latest and let you upgrade clients in the future with the upgrade clients script.

dkim777 commented 6 months ago

reset validator and monitoring, ran setup script and monitoring script. took 7 days to update the blockchain, but now I'm getting these error messages.

journalctl -u geth.service -f Apr 21 16:38:22 validator3 geth[867]: WARN [04-21|16:38:22.908] Post-merge network, but no beacon client seen. Please launch one to follow the chain! Apr 21 16:43:22 validator3 geth[867]: WARN [04-21|16:43:22.950] Post-merge network, but no beacon client seen. Please launch one to follow the chain!

journalctl -u lighthouse-beacon.service -f Apr 21 16:45:03 validator3 systemd[1]: lighthouse-beacon.service: Main process exited, code=exited, status=1/FAILURE Apr 21 16:45:03 validator3 systemd[1]: lighthouse-beacon.service: Failed with result 'exit-code'. Apr 21 16:45:09 validator3 systemd[1]: lighthouse-beacon.service: Scheduled restart job, restart counter is at 87. Apr 21 16:45:09 validator3 systemd[1]: Stopped Lighthouse Beacon. Apr 21 16:45:09 validator3 systemd[1]: Started Lighthouse Beacon. Apr 21 16:45:09 validator3 lh[2114]: error: The argument '--metrics' was provided more than once, but cannot be used multiple times Apr 21 16:45:09 validator3 lh[2114]: USAGE: Apr 21 16:45:09 validator3 lh[2114]: lh beacon_node --auto-compact-db --builder-fallback-epochs-since-finalization --builder-fallback-skips --builder-fallback-skips-per-epoch --builder-profit-threshold --checkpoint-sync-url --checkpoint-sync-url-timeout --datadir --debug-level --enr-address
... --enr-tcp-port --enr-udp-port --eth1-blocks-per-log-query --execution-endpoint --execution-jwt --execution-timeout-multiplier --fork-choice-before-proposal-timeout --http --http-address
--http-port --listen-address
... --logfile-debug-level --logfile-max-number --logfile-max-size --metrics --metrics-address
--metrics-port --network --port --port6 --prune-payloads --slasher-broadcast --suggested-fee-recipient --validator-monitor-auto Apr 21 16:45:09 validator3 lh[2114]: For more information try --help Apr 21 16:45:09 validator3 systemd[1]: lighthouse-beacon.service: Main process exited, code=exited, status=1/FAILURE Apr 21 16:45:09 validator3 systemd[1]: lighthouse-beacon.service: Failed with result 'exit-code'.

journalctl -u lighthouse-validator.service -f Apr 21 16:45:24 validator3 systemd[1]: lighthouse-validator.service: Failed with result 'exit-code'. Apr 21 16:45:29 validator3 systemd[1]: lighthouse-validator.service: Scheduled restart job, restart counter is at 91. Apr 21 16:45:29 validator3 systemd[1]: Stopped Lighthouse Validator. Apr 21 16:45:29 validator3 systemd[1]: Started Lighthouse Validator. Apr 21 16:45:29 validator3 lh[2131]: error: The argument '--metrics' was provided more than once, but cannot be used multiple times Apr 21 16:45:29 validator3 lh[2131]: USAGE: Apr 21 16:45:29 validator3 lh[2131]: lh validator_client --debug-level --http-port --latency-measurement-service --logfile-debug-level --logfile-max-number --logfile-max-size --metrics --metrics-address

--metrics-port --network --suggested-fee-recipient --validator-registration-batch-size Apr 21 16:45:29 validator3 lh[2131]: For more information try --help Apr 21 16:45:29 validator3 systemd[1]: lighthouse-validator.service: Main process exited, code=exited, status=1/FAILURE Apr 21 16:45:29 validator3 systemd[1]: lighthouse-validator.service: Failed with result 'exit-code'.

rhmaxdotorg commented 6 months ago

Looks like the service config parameters got messed up, which can happen when monitoring/setup scripts are ran at a different times or things get confused.

You can manually fix it by editing the service file for the client that's failing (looks like lighthouse-beacon.service) and removing the extra cmd line args (appears multiple --metric parameters, whatever the monitoring script is adding, only should be one set of them).

A correct config looks something like this...

[Unit]
Description=Lighthouse Beacon
After=network.target
Wants=network.target

[Service]
User=node
Group=node
Type=simple
Restart=always
RestartSec=5
ExecStart=/opt/lighthouse/lighthouse/lh bn --network pulsechain --datadir=/opt/lighthouse/data/beacon --execution-endpoint=http://localhost:8551 --execution-jwt=/var/lib/jwt/secret --enr-address=XXXXX --enr-tcp-port=9000 --enr-udp-port=9000 --suggested-fee-recipient=XXXXX --checkpoint-sync-url=https://checkpoint.pulsechain.com --http --metrics --validator-monitor-auto

[Install]
WantedBy=multi-user.target

Then I think the service commands are...

$ sudo systemctl daemon-reload
$ sudo systemctl restart lighthouse-beacon.service

And check logs again to see if it worked.

dkim777 commented 6 months ago

I ran the monitoring install script twice, that's why probably. so I've edited all three service files geth, lighthouse-beacon and lighthouse-validator restarted the daemon and service. No more errors but it doesn't seem to generate any yields(withdrawals) any ideas?

journalctl -u geth.service -f Apr 22 12:59:55 validator3 geth[34723]: INFO [04-22|12:59:55.727] Imported new potential chain segment number=20,182,633 hash=b4ddc7..f8db01 blocks=1 txs=130 mgas=21.089 elapsed=97.927ms mgasps=215.350 snapdiffs=6.30MiB triedirty=1018.51MiB Apr 22 12:59:55 validator3 geth[34723]: INFO [04-22|12:59:55.755] Chain head was updated number=20,182,633 hash=b4ddc7..f8db01 root=aad3c0..1be789 elapsed=2.006112ms Apr 22 13:00:05 validator3 geth[34723]: INFO [04-22|13:00:05.396] Imported new potential chain segment number=20,182,634 hash=acc733..77f62e blocks=1 txs=82 mgas=12.247 elapsed=67.798ms mgasps=180.636 snapdiffs=6.33MiB triedirty=1018.53MiB Apr 22 13:00:05 validator3 geth[34723]: INFO [04-22|13:00:05.422] Chain head was updated number=20,182,634 hash=acc733..77f62e root=2c823a..cf4a1b elapsed=1.283782ms

journalctl -u lighthouse-beacon.service -f Apr 22 13:03:20 validator3 lh[34618]: Apr 22 17:03:20.000 INFO Synced slot: 3001984, block: 0x8e5e…7951, epoch: 93812, finalized_epoch: 93810, finalized_root: 0x738b…9270, exec_hash: 0x9072…42c2 (verified), peers: 79, service: slot_notifier Apr 22 13:03:25 validator3 lh[34618]: Apr 22 17:03:25.404 INFO New block received root: 0x78c0b4ab5a2699170ab63949f64f7ccf0802f49bc39af1b0faec5b64c6e2a3f9, slot: 3001985 Apr 22 13:03:30 validator3 lh[34618]: Apr 22 17:03:30.001 INFO Synced slot: 3001985, block: 0x78c0…a3f9, epoch: 93812, finalized_epoch: 93810, finalized_root: 0x738b…9270, exec_hash: 0x72cb…17ff (verified), peers: 80, service: slot_notifier Apr 22 13:03:35 validator3 lh[34618]: Apr 22 17:03:35.286 INFO New block received root: 0x7c58b82c812104900793218dd509e026f5b50d0455eef1eed860a0182d3cb824, slot: 3001986 Apr 22 13:03:40 validator3 lh[34618]: Apr 22 17:03:40.001 INFO Synced slot: 3001986, block: 0x7c58…b824, epoch: 93812, finalized_epoch: 93810, finalized_root: 0x738b…9270, exec_hash: 0x7712…79d7 (verified), peers: 80, service: slot_notifier Apr 22 13:03:45 validator3 lh[34618]: Apr 22 17:03:45.272 INFO New block received root: 0xaa8be21764bd700febc758f15d931b53c176f2af1b2cbc74c67e918d78bc2101, slot: 3001987

journalctl -u lighthouse-validator.service -f Apr 22 13:05:00 validator3 lh[34606]: Apr 22 17:05:00.000 INFO Connected to beacon node(s) synced: 1, available: 1, total: 1, service: notifier Apr 22 13:05:00 validator3 lh[34606]: Apr 22 17:05:00.002 INFO All validators active slot: 3001994, epoch: 93812, total_validators: 3, active_validators: 3, current_epoch_proposers: 0, service: notifier Apr 22 13:05:10 validator3 lh[34606]: Apr 22 17:05:10.001 INFO Connected to beacon node(s) synced: 1, available: 1, total: 1, service: notifier Apr 22 13:05:10 validator3 lh[34606]: Apr 22 17:05:10.001 INFO All validators active slot: 3001995, epoch: 93812, total_validators: 3, active_validators: 3, current_epoch_proposers: 0, service: notifier Apr 22 13:05:20 validator3 lh[34606]: Apr 22 17:05:20.000 INFO Connected to beacon node(s) synced: 1, available: 1, total: 1, service: notifier Apr 22 13:05:20 validator3 lh[34606]: Apr 22 17:05:20.001 INFO All validators active slot: 3001996, epoch: 93812, total_validators: 3, active_validators: 3, current_epoch_proposers: 0, service: notifier

rhmaxdotorg commented 6 months ago

Where are you checking for yield?

You can check using Beacon Explorer to see: https://twitter.com/rhmaximalist/status/1781324036217454683

dkim777 commented 6 months ago

You're right about the Beacon explorer. image

I assumed that validator was not working because I'm not getting any deposits, but looks like it's working. does it still take time to have deposits into my account because I was down for 2months and there's some sort of penalties?

rhmaxdotorg commented 6 months ago

Great!

Hmm, I believe the balance should build back up over time to 32m and then excess PLS auto-withdrawaled as determined by the network.

dkim777 commented 6 months ago

Thanks for all your help! Finally, I'm assuming that I can just update the OS (apt update && apt upgrade) and run update-client.sh? or should I update the OS AND cloned directory first (git pull) and then run the updated update-client.sh? Thanks again, without your help, this recovery would have been really difficult.

On Mon, Apr 22, 2024 at 8:13 PM rhmaxdotorg @.***> wrote:

Great!

Hmm, I believe the balance should build back up over time to 32m and then excess PLS auto-withdrawaled as determined by the network.

— Reply to this email directly, view it on GitHub https://github.com/rhmaxdotorg/pulsechain-validator/issues/37#issuecomment-2071157173, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADABHEAARG2PY34WEUW4YZ3Y6WRTRAVCNFSM6AAAAABEKD6JCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZRGE2TOMJXGM . You are receiving this because you authored the thread.Message ID: @.***>

rhmaxdotorg commented 6 months ago

Sure thing.

I think updating the OS is fine, but there's always a chance something breaks (so I don't do it often unless critical security or usability updates, which are rare).

If you installed the clients from scratch recently, they should already be on the latest version, so no need to update them.

So if you just rebuilt the validator and want to update the OS, I don't have any concerns however I don't personally do it often and usually just update my clients when there's something important to address and pull updates for.