rhmaxdotorg / pulsechain-validator

PulseChain Validator Automated Setup Scripts
https://launchpad.pulsechain.com
38 stars 13 forks source link

Suggestion: do-snapshot.sh #24

Closed wishbonesr closed 1 year ago

wishbonesr commented 1 year ago

Use multi-threading when taking snapshot This is for Ubuntu only, and there are downsides...not

Add: sudo apt install pixz

Add (in general) Comments at each step - otherwise there is no indication of progress even at service stop. Additionally, I would suggest monitoring such a lengthy process by tmux a second session (CTRL+B " ), then htop in one pane, and do-snapshot.sh in the other. Unfortunately tmux panes/windows can't be scripted to my knowledge. I would love to learn otherwise.

As well, with tmux or screen, the script/commands will complete even if you get disconnected from your remote server.

Change Line 60: sudo -u node bash -c "time tar -I pixz- cf $GETH_SNAPSHOT $GETH_DATA &>/dev/null"

Change Line 63: sudo -u node bash -c "time tar -I pixz -cf $LIGHTHOUSE_SNAPSHOT $LIGHTHOUSE_BEACON_DATA &>/dev/null"

image

rhmaxdotorg commented 1 year ago

Good suggestions!

Just to clarify, these are the proposed changes: 1) add pixz to the package install list @ https://github.com/rhmaxdotorg/pulsechain-validator/blob/main/do-snapshot.sh#L54 2) add echo for progress/steps prior to running the operations on lines 60 & 63 3) add documentation to the README section for the snapshot script section to recommend running the script in a tmux terminal in case of disconnect / prevent losing progress and having to re-run script as well as htop in another terminal to monitor processes

Does that sound right?

Couple questions: 1) As I'm not familiar with using pixz for parallelization here, have you tested the command line yourself (or modified the script and have a pull request)? 2) Did you use time to measure the difference between running with and without pixz? Would be interested to know those stats too

wishbonesr commented 1 year ago

Clarified Proposal: You nailed it. Might want to add that if you need to get back to a disconnected session, you list your running sessions with tmux ls, and attach to the running session tmux attach -t {sessionNumber} Or a wonderful article breaking down tmux: https://www.sitepoint.com/tmux-a-simple-start

Edit comments section of do-snapshot.sh to include modified extraction steps.

# - Then you can run the following commands ON THE NEW SERVER
#
# $ sudo systemctl stop geth lighthouse-beacon lighthouse-validator
# $ pixz -d lighthouse.tar.xz
# $ tar -xf lighthouse.tar
# $ pixz -d geth.tar.xz
# $ tar -xf geth.tar
# $ sudo cp -Rf opt /
# $ sudo chown -R node:node /opt
# $ sudo systemctl start geth lighthouse-beacon lighthouse-validator

Answers:

  1. It's my first time use of multi-threading a tar-compression as well. I stopped the single-threaded instance after 8 hours. My validator is taking enough hits being down, and I need to get off of AWS as quickly as possible - moving to Hetzner dedicated hardware for $65/month. That is all to say, that it's still running. And is why I didn't make a pull. It would be desirable for someone else to test or at least finish my run before requesting a pull.
  2. I used time to prevent the script from moving on to the next step, as the parallelization is using another session in the background. We must still have the tar finish it's creation before moving on to the next step. time wraps the parallel step and waits for it to complete, with the added benefit of stats.

geth is still compressing (1st tar), and has been going for 12 hours so far on my 8 core EC2 instance. I will post the output of the multi-threading method when finished. I won't be going back to test without multi-threading (because I need to shut down the Ec2 which is forecasting $1200 this month), but the expectation is 1/3. https://www.peterdavehello.org/2015/02/use-multi-threads-to-compress-files-when-taring-something/

FYI - Here's the status of the tar so far. I pruned the chain before starting, so I'm not certain what size to expect in the end.

ubuntu@________:~$ ls -lh /tmp/*.xz
-rw-rw-r-- 1 node node 463G Jul  9 20:17 /tmp/geth.tar.xz

It's worth noting that after 8Hr of running single threaded, the tar size of geth only made it to 6.4G before I killed it. The above shows after 12 hours the tar is 72X further along. Or hour to hour 48X faster.

Update Took 18 hours.

-rw-rw-r-- 1 ubuntu ubuntu 585G Jul  9 23:38 geth.tar.xz
-rw-rw-r-- 1 ubuntu ubuntu 770M Jul  9 23:40 lighthouse.tar.xz

Note: Even with service restart on the source machine, I was falling behind in slots, but geth was sync'd in 5 min. So I rebooted, all came back to sync state.

Now copying across the internet. ETA 10hr. Estimate was accurate at 10hr to copy.

Currently extracting on destination machine. I am somewhat cautious about geth nodekeys \opt\geth\data\geth\nodekey, as well as \opt\geth\data\geth\nodes\

Update 2.8 hours to decompress to tar. 2.5 hours to untar

Update Delay in testing the server migration Discovered that the new server is a bare-bones Ubuntu 22.04 which did not have snap installed. It took me way to long to identify that go was never installed. I scp'd the archives to the node user home....so the reset script wiped out the archives :( Lesson: read all the script to know exactly what they are doing. lol.

Update I don't know man. tar -t geth.tar show the last sst file in the tarball is 3779264.sst After extracting, the last file in ~/opt/geth/data/geth/chaindata/ is 3750082.sst I'm just confused at this point.

I mean...it's more than starting from scratch, but it still looks like a minimum of a few days to achieve sync. Current block 0x10eace6 Highest block 0x10f4ba7

Lighthouse sync'd in 2 minutes.

rhmaxdotorg commented 1 year ago

Let me know how it goes!

OK to upgrade the snapshot script at some point, just hoping you can test things out while you're in the middle of things as I have limited time/resources to spend on bigger changes right now.

wishbonesr commented 1 year ago

Understood. I'm going to keep updating the post above with updates. the scp copy finished about an hour ago, and am uncompacting now. Should be four more hours, so probably won't finish test today. Although the comments and readme do say look for errors in the setup script, I might also suggest adding a check for snap, sudo apt-get install snapd -y, just before the call to install go with snap.

My motivation to finish this is that $1200 bill that aws is building up to. The aws cost tool is up to $210 now, so clock is ticking. :)

rhmaxdotorg commented 1 year ago

Understood. I'm going to keep updating the post above with updates. the scp copy finished about an hour ago, and am uncompacting now. Should be four more hours, so probably won't finish test today. Although the comments and readme do say look for errors in the setup script, I might also suggest adding a check for snap, sudo apt-get install snapd -y, just before the call to install go with snap.

My motivation to finish this is that $1200 bill that aws is building up to. The aws cost tool is up to $210 now, so clock is ticking. :)

Well, I want to assume everyone is using Ubuntu (where Snap is default) as I'm avoiding supporting other Linux distros for now.

And yes, I'm evaluating Digital Ocean as a cheaper alternative to AWS for validators. AWS is great, but a cheaper cloud should be able to run validators just fine.

wishbonesr commented 1 year ago
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu

$ hostnamectl
 Static hostname: hetz-8core-caddog-val-1
       Icon name: computer-desktop
         Chassis: desktop
      Machine ID: 26835c80ce964b03ae0366b18c8ae6b9
         Boot ID: c50bba03bdda4a0db9805d5f6a60b5d4
Operating System: Ubuntu 22.04.2 LTS
          Kernel: Linux 5.15.0-76-generic
    Architecture: x86-64
 Hardware Vendor: FUJITSU

I believe this was an Ubuntu barebones install. I couldn't explain any other reason for snap to not be there. Going to do a local manual install of Ubuntu server barebones, and see if snap is also not present. Otherwise it could be the provider that removed it in recoverymode.

Not to turn this into an ad, but Hetzner server auctions repurpose servers in their data center, and host them. They're not located in the US, but there are some fantastic specs available for <$80/month with unlimited bandwidth. Notice above, it is a dedicated machine.

Previous comment post has the updates as I move along testing this. Note the tarball mention. Any thoughts on that?

rhmaxdotorg commented 1 year ago

Do you know if it was an Ubuntu server install vs Ubuntu desktop? I could believe server edition wouldn't come with it by default vs desktop, but both I would think would have it... either way, added snap to the APT_PACKAGES list.

https://github.com/rhmaxdotorg/pulsechain-validator/commit/bd21d37a0ae366d09763cb83574ade8f15b35cfb

For the snapshot script, I've updated the README with notes on tmux.

I'll work on this and attach some new code to test.

rhmaxdotorg commented 1 year ago

Ok @wishbonesr can you try this one out?

I'm testing the script itself now, but if that works well, would like some help testing the modified migration instructions to make sure they work exactly as the original.

#!/usr/bin/env bash
#
# PulseChain Snapshot helper script to backup blockchain data for transferring from one server to another
#
# Description
# - Takes a snapshot of blockchain data on a fully synced validator so it can be copied over and
# used to bootstrap a new validator -- clients must be stopped until the snapshot completes,
# afterwards they will be restarted so the validator can resume normal operation
#
# Environment
# - Tested on Ubuntu 22.04 (validator server) running Geth and Lighthouse clients
#
# What to do after running this script
# - Copy the geth.tar.xz and lighthouse.tar.xz (compressed like ZIP files) over to the new validator
# server (see scp demo below OR use a USB stick)
#
# $ scp geth.tar.xz ubuntu@new-validator-server-ip:/home/ubuntu
# $ scp lighthouse.tar.xz ubuntu@new-validator-server-ip:/home/ubuntu
#
# - Then you can run the following commands ON THE NEW SERVER
#
# $ sudo systemctl stop geth lighthouse-beacon lighthouse-validator
# $ pixz -d geth.tar.xz
# $ tar -xf geth.tar
# $ pixz -d lighthouse.tar.xz
# $ tar -xf lighthouse.tar
# $ sudo cp -Rf opt /
# $ sudo chown -R node:node /opt
# $ sudo systemctl start geth lighthouse-beacon lighthouse-validator
#
# Note: this should work fine for Ethereum too as it's just copying the blockchain data directories
# for Geth and Lighthouse, but the scenario is technically untested; also, this relies on the new
# validator setup (which you are copying the snapshot to) to be setup with this repo's setup script
#

GETH_DATA="/opt/geth/data"
LIGHTHOUSE_BEACON_DATA="/opt/lighthouse/data/beacon"

LANDING_DIR=$HOME # default (change as needed)
TMP_DIR="/tmp/"

GETH_SNAPSHOT=$TMP_DIR"geth.tar.xz"
LIGHTHOUSE_SNAPSHOT=$TMP_DIR"lighthouse.tar.xz"

trap sigint INT

function sigint() {
    exit 1
}

echo -e "ARE YOU SURE YOU WANT TO TEMPORARILY STOP CLIENTS TO TAKE A SNAPSHOT ON THE VALIDATOR?\n"
echo -e "* it could take anywhere from a few hours to a couple days to complete -- depending mostly on blockchain data size and server specs *\n"
read -p "Press [Enter] to Continue"

# install xz and pixz (if not already installed)
sudo apt install -y xz-utils pixz

# stop client services
#sudo systemctl stop geth lighthouse-beacon lighthouse-validator

echo -e "\nstep 1: taking snapshot of geth data\n"

# compress geth directory
#sudo -u node bash -c "tar -cJf $GETH_SNAPSHOT $GETH_DATA &>/dev/null"
sudo -u node bash -c "time tar -Ipixz -cf $GETH_SNAPSHOT $GETH_DATA &>/dev/null"

echo -e "\nstep 2: taking snapshot of lighthouse data"

# compress lighthouse directory
#sudo -u node bash -c "tar -cJf $LIGHTHOUSE_SNAPSHOT $LIGHTHOUSE_BEACON_DATA &>/dev/null"
sudo -u node bash -c "time tar -Ipixz -cf $LIGHTHOUSE_SNAPSHOT $LIGHTHOUSE_BEACON_DATA &>/dev/null"

# fix perms
sudo chown -R $USER:$USER $GETH_SNAPSHOT
sudo chown -R $USER:$USER $LIGHTHOUSE_SNAPSHOT

# move snapshots to landing directory
mv $GETH_SNAPSHOT $LANDING_DIR
mv $LIGHTHOUSE_SNAPSHOT $LANDING_DIR

# start client services
#sudo systemctl start geth lighthouse-beacon lighthouse-validator

echo -e "\nProcess is complete"
wishbonesr commented 1 year ago

snap is present on an Ubuntu minimalized install on the vbox (that I just stood up). So it must be something that Hetzner either removes, or has their own customized build. But I appreciate the addition (something I won't have to keep track of in the future).

NOTE: you chose snap as the package name, it should be snapd

I will start the test today, but also starting from a scratch build, as I'm done haulting my active validator :). I should have something to report tomorrow or late today as I won't let geth get all the way to a sync'd state. Just a few hours of sync past the fork epoch.

It will work as is, but I would change the file extension for geth.tar.xz to geth.tar.pixz (as well for lighthouse), so that it's clear you unpack this differently from xz. While pixz is based on xz compression, it will not decompress with xz.

rhmaxdotorg commented 1 year ago

Oh good catch (although I think snap refers to the same package set or similar one), changed it to snapd.

Also I've been running the script for 12+ hours and its still going (on geth data).

A big change in this new version of the script is that it doesn't stop the client services as I'm curious if it can work/complete the process without downtime.

Let me know how it goes!

wishbonesr commented 1 year ago

snap apt install snap fails. It suggests snapd. You changed it, so me confirming is redundant.

Snapshot w/o stopping Here's why most suggest stopping. Since geth 1.8.x, to reduce the number of disk writes (2MB database files), geth caches the db in memory (size depends on available memory). Take note of logs indicating "dirty". If you don't shut geth gracefully down, then the service does not write several days of chain data to the disk. On one machine that's ~400MB (~4 days), on another that is 1GB (~1 week).

If you're going to pursue this method, then I would suggest gracefully shutting down geth, and restarting; to get the memory portion written to disk. That way, you'll only be short of geth db files of that when the tarball is being created. Otherwise you could be out dirty + tar generation time. And, you'll only be down for one attestation.

btw - I've been reading up on the /ancient folder. Thinking there might be some advantage play here.

Not Quite a Conclusion There are three distinct scenarios. 1) You are migrating a server to better specs or different provider. 2) You are standing up a new machine. 3) You are running a backup server to come online if the active machine has a failure.

My goal was the first.

Snapshot: While the tarball sounded like a good idea, the validator was offline for around 2 days. Copy to new validator took around 1 day, extraction took a couple of days, and now achieving sync has been 3 days. At this point, waiting 7 days for a full sync would not be so bad after all.

This will cross the line eventually as the chain history of pulse grows from checkpoint sync. At this point, I believe a snapshot will be worth it's weight in gold; But at this point is not beneficial (imo).

New idea: rsync I'm running this test now, and doing it while the source validator is active. It's not necessarily saving time, but geth will not be halted until a second or even third rsync is needed. The first rsync is running for two whole days, and I expect it to finish during the night. Second rsync should only take a few hours. Then will be shutting down geth, and performing the last rsync. I'm not using checksum on the first two: sudo -u node bash -c "rsync -rlEAgtzP --size-only /opt/lighthouse/ /home/node/mnt/val2/lighthouse" So I cheated a little bit by using sshfs to mount the remote to a local node folder. It can work with just rsync straight, I just chose to do it this way, Notice, I'm not using the archive switch - which is because of all the switches that -a implies, like security and symlinks.

  • -r recursive
  • -I copy symlinks as symlinks
  • -E preserve executability
  • -A preserve ACLs
  • 'g preserve group
  • -t preserve modification times
  • -z compress file data during the transfer
  • -P same as --partial --progress
  • --size-only doesn't use checksum, or date; use the size of the file to determine overwrite. The third run will probably use checksum -c

Finally: Questioning snapshot in its entirety... By every account, using checkpoint indicates you can start validating within minutes; While the beacon continues to direct geth to backfill. https://checkpoint.pulsechain.com is updated . If not that, is it for some other reason? I realize full sync can take several days (mine took 7), but is full sync needed with checkpoint sync. What are the mechanics?! I don't have another 32M PLS to test this theory...but...I will do one more machine with syncpoint, disable the active node (migration), and immediately activate the new. If in a few hours, the new validator isn't performing validations, then I've misunderstood something. If it doesn't work, then my tandem test with rsync may prove to be of use. Reference:

wishbonesr commented 1 year ago

Please be patient with me, as I'm a noob, and am only just now discovering.

With my brain telling me that using checkpoint sync implied getting the node to a active validator in a few minutes - see above urls; I started parsing the logs, and noticed lighthouse-beacon reporting: `

WARN Remote BN does not support EIP-4881 fast deposit sync, error: fetching deposit snapshot from remote: ServerMessage(ErrorMessage { code: 415, message: "unsupported content-type: application/octet-stream", stacktraces: [] }), service: beacon
ERRO Error updating deposit contract cache                  error: Failed to get remote head and new block ranges: EndpointError(FarBehind), retry millis: 60000, service: deposit_contract_rpc
INFO Beacon chain initialized                               head_slot: 579488, head_block: 0xd71b...38ed, head_state: 0x63b4...b3fd, service: beacon

and the key is the service state appears to reverts to a sync mode/state, which defeats the purpose of a checkpoint INFO Sync state updated new_state: Syncing Head Chain, old_state: Stalled, service: sync

https://rustrepo.com/repo/sigp-lighthouse-rust-concurrency -Search the page for "Deposit Snapshot Sync" It doesn't appear that the checkpoint URL supports the protocol required for the original intent of the checkpoint method.

Different project that discussed implementing the new checkpoint feature around the time the lighthouse checkpoint feature was added. https://github.com/ethpandaops/checkpointz/issues/74 We should be seeing 15 minutes to attestation - at least that is what you could expect on Ethereum.

Who runs https://checkpoint.pulsechain.com ?

rhmaxdotorg commented 1 year ago

Yeah I wondered if there would be any db corruption from copying live, but that's why I was up for folks to test it and find out before calling it good as the migration steps take time and spare cycles to go through the whole process.

However, during my timing testing it's still taking "forever" to snapshot geth.

step 1: taking snapshot of geth data

real    2951m34.764s
user    10113m32.863s
sys 107m54.793s
step 2: taking snapshot of lighthouse data

real    6m6.080s
user    20m48.026s
sys 0m14.959s

Process is complete

2951 minutes = ~49 hours

I'm not sure what it was taking before, but I think around the same. So for some reason I didn't notice a shortened compression time using pixz for some reason...

Anyways, I think I'm going to pause on testing/researching this one as I'm getting occupied with other work that's taking some of my free time.

rhmaxdotorg commented 1 year ago

Please be patient with me, as I'm a noob, and am only just now discovering.

With my brain telling me that using checkpoint sync implied getting the node to a active validator in a few minutes - see above urls; I started parsing the logs, and noticed lighthouse-beacon reporting: `

WARN Remote BN does not support EIP-4881 fast deposit sync, error: fetching deposit snapshot from remote: ServerMessage(ErrorMessage { code: 415, message: "unsupported content-type: application/octet-stream", stacktraces: [] }), service: beacon
ERRO Error updating deposit contract cache                  error: Failed to get remote head and new block ranges: EndpointError(FarBehind), retry millis: 60000, service: deposit_contract_rpc
INFO Beacon chain initialized                               head_slot: 579488, head_block: 0xd71b...38ed, head_state: 0x63b4...b3fd, service: beacon

and the key is the service state appears to reverts to a sync mode/state, which defeats the purpose of a checkpoint INFO Sync state updated new_state: Syncing Head Chain, old_state: Stalled, service: sync

https://rustrepo.com/repo/sigp-lighthouse-rust-concurrency -Search the page for "Deposit Snapshot Sync" It doesn't appear that the checkpoint URL supports the protocol required for the original intent of the checkpoint method.

Different project that discussed implementing the new checkpoint feature around the time the lighthouse checkpoint feature was added. ethpandaops/checkpointz#74 We should be seeing 15 minutes to attestation - at least that is what you could expect on Ethereum.

Who runs https://checkpoint.pulsechain.com ?

Hmm interesting. As far as who runs checkpoint.pulsechain.com, I'd say the PLS devs.

Are you in t.me/PulseDev or want to reach out there?

wishbonesr commented 1 year ago

I am not. Seemed like an invite only kind of thing. I've stood up seven different machines (one active), and all are showing this at the head of the log. I'd be curious what the start of geth and lighthouse-beacon logs show for others; If this is actually the same for everybody, and simply no one complained.

Other two experiments still have a few days to complete.

rhmaxdotorg commented 1 year ago

@wishbonesr any other updates on this one?

I'd like to close it out and keep it as reference in case either I or someone else has bandwidth to work on or test a new version of the snapshot script, great notes along the way.

wishbonesr commented 1 year ago

Still running tests. As you've seen, it can take a significant amount of time to get a new machine sync'd traditionally, and then stop services, archive, transfer, unarchive, resync.

1st round - second machine failed with so many errors, indicating the geth data was corrupted. The only resolution is to delete /opt/geth/data/geth/chaindata, and start over from archive. ...but the archive data appeared to be corrupt as well. This process took many days to accomplish.

I don't think using an archive is usefull unless you have nvme's directly attached and 16+ cores.. I don't have the $$ to test this kind of hardware.

2nd test, Fully sync'd machine running rsync (or osync) to a second machine until sync'd. Stop primary validator, and rsync last time,

1st round - interrupted by overwhelming cost of ec2 instance >$1200. I'm restarting this test on capped machines that won't exceed $100/month and still have 8 cores & 64 GB RAM.

3rd test. Personal Checkpoint Validator. The intended behavior of a checkpoint server is to begin validating within an hour. This is simply not the case with the official Pulse checkpoint server, as it returns in invalid response to geth service. This could be a bug in the forked geth project, or it could be an issue with the checkpoint implementation.

Waiting on this machine to sync. Once sync'd will install checkpoint package and test new machine to my own checkpoint server.

wishbonesr commented 1 year ago

@rhmaxdotorg FYI - Once this fix is merged, it's going to change everything. https://gitlab.com/pulsechaincom/lighthouse-pulse/-/issues/4

It's insane how successful Pulse has been even withstanding taking a week or more to attest, when folks could have been up and running in less than an hour.

rhmaxdotorg commented 1 year ago

@rhmaxdotorg FYI - Once this fix is merged, it's going to change everything. https://gitlab.com/pulsechaincom/lighthouse-pulse/-/issues/4

It's insane how successful Pulse has been even withstanding taking a week or more to attest, when folks could have been up and running in less than an hour.

Woah! Nice, that certainly looks helpful.