Closed Sipheren closed 1 year ago
Same with me - Didn't get the line 17 syntax error , but did get the rest of the update error you had. Stopped and nothing else had updated.
ARE YOU SURE YOU WANT TO GO OFFLINE TO STOP, UPDATE AND RESTART PULSECHAIN CLIENTS ON THE VALIDATOR?
- it could take 30 - 60 minutes to complete -- depending mostly on bandwidth and server specs *
Hit [Enter] to Continue OR Ctrl+C to Cancel
Step 1: Stop PulseChain clients (Geth and Lighthouse) [sudo] password for xxxxxxxx: sudo: node: command not found
Step 2: Pull updates and rebuild clients
fatal: not a git repository (or any of the parent directories): .git info: using existing install for 'stable-x86_64-unknown-linux-gnu' info: default toolchain set to 'stable-x86_64-unknown-linux-gnu'
stable-x86_64-unknown-linux-gnu unchanged - rustc 1.69.0 (84c898d65 2023-04-16)
fatal: not a git repository (or any of the parent directories): .git
Step 3: Starting PulseChain clients
Process is complete
@gazaz, @Sipheren,
There are actually several problems.
Line 42: sudo -u $NODE_USER $NODE_USER -c
...etc.
Should read: sudo -u $NODE_USER bash -c
...etc.
Additionally, there is a flaw in the setup script which causes this script to fail to make the git pull; Which doesn't copy the hidden .git repository metadata to the destination /opt/geth, or /opt/lighthouse.
I would add to the [config] section Line 41 of pulsechain-validator-setup.sh the following
shopt -s dotglob
One might take fore granted that this is a common item to enable for admins, and might be something added to ~/.bashrc
Solution:
Rebuild the server from scratch or manually remove and rebuild the geth and lighthouse repositories. Follow the steps in the setup script, but add the shopt -s dotglob
.
Don't forget to re-add the symlink - see pulsechain-validator-setup.sh, Line 199
Thanks all for the details!
And of course apologies, I guess the update script wasn't tested as well as I thought after I made some major updates to the setup script a while back. Working on it!
Great insight on the shopt as well! Will update and test scripts again.
update-client.sh has been updated with the primary bug which was the typo fix:
https://github.com/rhmaxdotorg/pulsechain-validator/commit/cf3bba6755822ae72542e7d9acb79230821edfab
Also updated the setup script:
https://github.com/rhmaxdotorg/pulsechain-validator/commit/547a6e4f3e3e3e505203e57b646f8664b99414d0
Gimme some time and I'll work on the steps and post them so you don't have to rebuild the clients, ETA today sometime if all goes well.
Awesome, thanks for all the replies.
Brilliant thanks for working on it !
@wishbonesr thank you so much for the detailed analysis, super helpful.
It looks like because the script does (line 137) sudo mv $GETH_REPO_NAME/* $GETH_DIR
(instead of using cp -R most places, which I think it used to be in the past) it's not picking up the hidden git files and then the script removes them from the home directory, so no chance to copy them over post-setup (which could have been a quick solution for people now).
So I do agree the most straightforward way going forward is to do a quick rebuild, which thankfully should take around the same time and running the update client would have. It's fast because by default the reset-validator.sh script keeps blockchain data, so no need to resync everything (however it may take a few hours or more to sync back what it needs). The only other thing is you'll need to import your keys to lighthouse again, which should only take a few minutes for most folks (who aren't running 100+ validators, if you are, sorry for the 100+ times you need to type in your password, but thank you very, very much for your service :).
Here's what I've just did to update my validator.
git pull
in the pulsechain-validator scripts directory to upgrade to the latest scripts (or delete the scripts directory and git clone https://github.com/rhmaxdotorg/pulsechain-validator.git
to download the scripts folder again then chmod +x *.sh
in the directory to make them executable)pico reset-validator.sh
and change line 6 from I_KNOW_WHAT_I_AM_DOING=false
to I_KNOW_WHAT_I_AM_DOING=true
, then ctrl+x
to exit, it will ask you to save so say y
and Enter to save the changes../reset-validator.sh
and hit Enter to reset the validator (may take up to 30 seconds)grep pulsechain-validator-setup.sh ~/.bash_history
) with ./pulsechain-validator-setup.sh 0xfee-address 11.22.ip.addrsudo cp -R ~/validator_keys /home/node
sudo chown -R node:node /home/node/validator_keys
sudo -u node bash
cd ~
/opt/lighthouse/lighthouse/lh account validator import --directory ~/validator_keys --network=pulsechain
After you've entered the wallet password for each validator and it's complete, you should see process completes successfully.
Successfully imported keystore.
Successfully updated validator_definitions.yml.
Successfully imported X validators (0 skipped).
Then start the lighthouse validator client and after the (hopefully brief) syncing completes, you should be back to validating!
sudo systemctl start lighthouse-validator
You can check the status of the clients using this command (hitting enter or spacebar to scroll down for more status important):
sudo systemctl status geth lighthouse-beacon lighthouse-validator
So just to be clear, if you are wanting to update your clients, the update script will not work without doing the quick rebuild which means you get the updated scripts, reset the validator (it keeps most of the blockchain data automatically) and run the setup script again and then you will be back validating on the latest and greatest network on earth (with the latest client updates too as the script pulls and uses the new clients).
After a few hours, the validator was back to Active, 99% effectiveness and earning fees again.
Again, apologies for the issue updating clients, and of course I pushed fixes as soon as a I could.
Worked a treat thanks, validator updated and back printing PLS :)
Thanks for the hard work.
Also, there have been suggestions of a way to avoid the "quick rebuild" by doing something like this.
update-client.sh
script (fixed typo that prevented it from running properly)update-client.sh
scriptIf anyone decides to tinker with this and verify it works, feel free to let us know the process. However, even with a few hours downtime of the validator, it still seems like the safest/most tested method as of now is the minimal rebuild process described in the prior post.
@wishbonesr thank you so much for the detailed analysis, super helpful.
It looks like because the script does (line 137)
sudo mv $GETH_REPO_NAME/* $GETH_DIR
(instead of using cp -R most places, which I think it used to be in the past) it's not picking up the hidden git files and then the script removes them from the home directory, so no chance to copy them over post-setup (which could have been a quick solution for people now).So I do agree the most straightforward way going forward is to do a quick rebuild, which thankfully should take around the same time and running the update client would have. It's fast because by default the reset-validator.sh script keeps blockchain data, so no need to resync everything (however it may take a few hours or more to sync back what it needs). The only other thing is you'll need to import your keys to lighthouse again, which should only take a few minutes for most folks (who aren't running 100+ validators, if you are, sorry for the 100+ times you need to type in your password, but thank you very, very much for your service :).
Here's what I've just did to update my validator.
1. Do a `git pull` in the pulsechain-validator scripts directory to upgrade to the latest scripts (or delete the scripts directory and `git clone https://github.com/rhmaxdotorg/pulsechain-validator.git` to download the scripts folder again then `chmod +x *.sh` in the directory to make them executable) 2. Do `pico reset-validator.sh` and change line 6 from `I_KNOW_WHAT_I_AM_DOING=false` to `I_KNOW_WHAT_I_AM_DOING=true`, then `ctrl+x` to exit, it will ask you to save so say `y` and Enter to save the changes. 3. Now `./reset-validator.sh` and hit Enter to reset the validator (may take up to 30 seconds) 4. Run the setup again (if you want to find your original setup command in history by doing `grep pulsechain-validator-setup.sh ~/.bash_history`) with ./pulsechain-validator-setup.sh 0xfee-address 11.22.ip.addr 5. Copy your validator_keys (eg. first command below assumes they are in your /home/ubuntu directory, if not then modify the command to copy them from where ever they are located) to the node user's home director and import them.
sudo cp -R ~/validator_keys /home/node sudo chown -R node:node /home/node/validator_keys sudo -u node bash cd ~ /opt/lighthouse/lighthouse/lh account validator import --directory ~/validator_keys --network=pulsechain
After you've entered the wallet password for each validator and it's complete, you should see process completes successfully.
Successfully imported keystore. Successfully updated validator_definitions.yml. Successfully imported X validators (0 skipped).
Then start the lighthouse validator client and after the (hopefully brief) syncing completes, you should be back to validating!
sudo systemctl start lighthouse-validator
You can check the status of the clients using this command (hitting enter or spacebar to scroll down for more status important):
sudo systemctl status geth lighthouse-beacon lighthouse-validator
So just to be clear, if you are wanting to update your clients, the update script will not work without doing the quick rebuild which means you get the updated scripts, reset the validator (it keeps most of the blockchain data automatically) and run the setup script again and then you will be back validating on the latest and greatest network on earth (with the latest client updates too as the script pulls and uses the new clients).
After a few hours, the validator was back to Active, 99% effectiveness and earning fees again.
Again, apologies for the issue updating clients, and of course I pushed fixes as soon as a I could.
Thanks for this, I am looking to go through this myself on the weekend. I have run the reset script once before during testing, works fine and doesn't take all that long to be back up and synced.
Also, major thanks for providing this repo and maintaining it in the first place, was wasting a lot of time trying to get everything setup manually or using dockers, this script was a godsend :)
Cheers
Also, there have been suggestions of a way to avoid the "quick rebuild" by doing something like this.
1. Pull the latest `update-client.sh` script (fixed typo that prevented it from running properly) 2. Clone the geth and lighthouse repos (may require at the specific version you're running running, for example v2.2.0) 3. Copy the hidden files to the appropriate /opt client home directories 4. Run the `update-client.sh` script
If anyone decides to tinker with this and verify it works, feel free to let us know the process. However, even with a few hours downtime of the validator, it still seems like the safest/most tested method as of now is the minimal rebuild process described in the prior post.
EDIT: All good, just had to re-add the metrics flags and that to the .service files, reload the daemon and restart the services. :)
Question, after the validator-reset script is run, should the Grafana dashboards all just kick back in or do I need to remove and re-run that script also?
They are all these and setup how I like but none seem to be getting any data, guessing the new install of geth and that doesnt link up the the db or something?
In case some folks find themselves unable to update this repo to their node (because you needed to edit the the safety switches in the scripts), git will tell you that you need to commit first. Since this is a one way operation, it's ok to set the HEAD of the clone back to the last clone/pull. Do this to avoid having to purge and re-clone via the url. Ex. - do this in your local "pulsechain-validator" repository folder.
git reset --hard
git pull
@rhmaxdotorg, To help close this out, I spun up a new instance, and went through all the steps, plus the update. All appears fine, and this issue resolved. Though it appears the corrections have addressed this Issue; Just so you know, it appears that lighthouse dependency on libsecp256k1, is not compiling on this EC2 instance. Spun up a VBox VM (same OS/ver), no issues.
EC2 - test instance error during install and/or update:
make
cargo install --path lighthouse --force --locked \
--features "jemalloc" \
--profile "release" \
Installing lighthouse v2.3.0 (/opt/lighthouse/lighthouse)
Updating crates.io index
warning: package `hermit-abi v0.3.1` in Cargo.lock is yanked in registry `crates-io`, consider running without --locked
Compiling libsecp256k1 v0.7.1
error: could not compile `libsecp256k1` (lib)
Caused by:
process didn't exit successfully: `rustc --crate-name libsecp256k1 --edition=2018 /home/node/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libsecp256k1-0.7.1/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=132 --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="default"' --cfg 'feature="hmac"' --cfg 'feature="hmac-drbg"' --cfg 'feature="sha2"' --cfg 'feature="static-context"' --cfg 'feature="std"' --cfg 'feature="typenum"' -C metadata=46bf597dfbc4d6db -C extra-filename=-46bf597dfbc4d6db --out-dir /opt/lighthouse/target/release/deps -L dependency=/opt/lighthouse/target/release/deps --extern arrayref=/opt/lighthouse/target/release/deps/libarrayref-160eae64aec5652b.rmeta --extern base64=/opt/lighthouse/target/release/deps/libbase64-9730bbaf724d0579.rmeta --extern digest=/opt/lighthouse/target/release/deps/libdigest-f552d1e4eba7ee5f.rmeta --extern hmac_drbg=/opt/lighthouse/target/release/deps/libhmac_drbg-db97445cb76fa586.rmeta --extern libsecp256k1_core=/opt/lighthouse/target/release/deps/liblibsecp256k1_core-e3baea4ccd3f1c72.rmeta --extern rand=/opt/lighthouse/target/release/deps/librand-01696e725a8c1af3.rmeta --extern serde=/opt/lighthouse/target/release/deps/libserde-c6f169afd9b20a69.rmeta --extern sha2=/opt/lighthouse/target/release/deps/libsha2-4b6da5a5c83321d4.rmeta --extern typenum=/opt/lighthouse/target/release/deps/libtypenum-8b1200c22ebb4b54.rmeta --cap-lints allow` (signal: 9, SIGKILL: kill)
error: failed to compile `lighthouse v2.3.0 (/opt/lighthouse/lighthouse)`, intermediate artifacts can be found at `/opt/lighthouse/target`
make: *** [Makefile:48: install] Error 101
Good to hear, @wishbonesr!
For the EC2 instance, were you using Ubuntu Linux (22.04) or Amazon Linux (probably the first choice/default)?
I've only tested/supported Ubuntu 22.04 for the script, so it may work on other OSes, but not strictly supported.
However, from googling it looks like sudo apt-get install libsecp256k1-0
would install it via system package manager on Ubuntu, but not sure about other Linux distros.
If you did come across this using Ubuntu 22.04, this package could be added to the APT_PACKAGES list in the script. Let me know if you want to test that while you're seeing the error, doesn't seem like it would hurt to add it anyways, but just trying to keep it as minimal packages as necessary to support the validators.
@rhmaxdotorg, It was Ubuntu 22.04 I'll probably give it another shot this weekend, as I've already term'd the test instance last night. I have a feeling a tweak of the lighthouse make file will also be required to eliminate libsecp256k1, if installing straight from jammy ports...but yeah, I'll get back with you on that.
Note: I could only afford one validator, and it's already up and running (so no urgency on my part) - just wanted to help out. If I can replicate the lighthouse compile failure, I'll start another issue for libsecp256k1, so this issue isn't polluted, and can be closed.
@wishbonesr
I ran into the same issue when using the t2.micro
free tier ec2 instance, the issue went away when I switched to the same instance type I have for my validator, which leads me to believe the compilation error is caused due to resource constraints.
On the other hand, I was able to test fixing the git repos and rebuilding the clients on a test instance. The steps I took were:
1. Change to be node user `sudo -u node bash & cd ~`
1. git clone https://gitlab.com/pulsechaincom/lighthouse-pulse
2. git clone git clone https://gitlab.com/pulsechaincom/go-pulse
2. Copy the `.git` folder `copy -R {repo}/.git /opt/{geth|lighthouse}`
3. Go to /opt/geth and lighthouse and run `git reset --hard`
4. Add `data` folder to `.gitignore` for both repos (Since next step removes untracked files, and we effectively fast forwarded current HEAD to main doing the reset hard)
5. Remove the files and folders that are not tracked anymore by running `git clean -f -d` (If we don't remove untracked files at least for GETH there will be duplicate structs, causing a compilation error)
6. Run the update script from this repo
7. After running the update script copy `cp /opt/lighthouse/target/release/lighthouse /opt/lighthouse/lighthouse/lh` (Is this normal after upgrading? Can someone please confirm?)
8. Cleanup the cloned repos
Due to step 7 I have not run these steps in my actual validator can someone please confirm that new lighthouse binary is not where the service config file expects it to be? Or is there something wrong with this method?
Ok I have run the above process on a live validator service, and wrote a helper script make sure to run it after switching users: sudo -u node bash
The following method avoids the first alternative and prevents you from having to re-add all the keys so it is handy if you have a lot of validators, also, see closing thoughts on ideas on how you can reduce the validator downtime during updates about ~99%
set -eo pipefail
cd /home/node
function cleanup() {
echo "Cleanup started"
git reset --hard # Set all files back to match `main``
echo -e "\ndata/" >> .gitignore # add data folder to the gitignore so its not cleaned up by following command
echo -e "\nlighthouse/lh" >> .gitignore # add symbolic link created by setup script to gitignore
git clean -f -d # Remove files that are not in `main` anymore and not tracked by git
git reset --hard # Reset gitignore
}
echo "Cloning go pulse repo..."
git clone https://gitlab.com/pulsechaincom/go-pulse
cp -R go-pulse/.git /opt/geth
pushd /opt/geth
cleanup
popd
echo "Cloning lighthouse repo..."
git clone https://gitlab.com/pulsechaincom/lighthouse-pulse
cp -R lighthouse-pulse/.git /opt/lighthouse
pushd /opt/lighthouse
cleanup
popd
echo "Removing cloned repos..."
rm -rf go-pulse && rm -rf lighthouse-pulse
echo "Done now run the update script"
After it's done you can pull the latest from repo: https://github.com/rhmaxdotorg/pulsechain-validator and run the update-client.sh
script.
The fact that the lighthouse binary is created elsewhere gives us the flexibility to update it without barely any downtime, Rust building is dog slow, taking about 45 mins - 1 hour whereas go builds (go-pulse [geth]) are very fast and don't have this problem, in the future we could leverage make
to add the builds to a different location, so that we could keep the lights on until it's time to replace the binary, this would make the update process go from a 45min- 1 hour downtime to under 30 seconds enough time to stop clients, copy/replace the new binaries and re-start them again.
Right now pulse is cheap and it might be expensive to be down for 1 hour if you have a lot of validators or if PLS moons, the idea above would solve this.
Awesome! Thanks for the script and interesting details in the closing thoughts @nicogranuja!
Just a question on this part:
WARNING
If you setup your validators using the setup script from this repo, the lighthouse built binary location will be expected to be in /opt/lighthouse/lighthouse/lh however, the new binary will not be there and if you cleaned up this binary using the script above, you will need to add this line before restarting clients in update-client.sh line 53
Just to clarify, are you suggesting any code changes to the update script?
Or is adding the sudo -u $NODE_USER bash...
line to update-client.sh only necessary if someone uses the helper script you shared?
As the steps of setting up a validator with the setup script and then running the update-client.sh script after a new version is released shouldn't affect the lh
link, but there's a few pieces to this and the setup and update script have some differences, so maybe add this before Starting PulseChain clients
@ https://github.com/rhmaxdotorg/pulsechain-validator/blob/main/update-client.sh#L54C1-L54C48
sudo -u $NODE_USER ln -s /opt/lighthouse/target/release/lighthouse /opt/lighthouse/lighthouse/lh
Since the script and service file want the latest lighthouse binary to point to /opt/lighthouse/lighthouse/lh
.
I wondered if you tested/saw this or otherwise agree (since you seem to have a better testing environment than me right now :)
Misc
Checking your Lighthouse and Geth versions
$ sudo /opt/lighthouse/lighthouse/lh --version
Lighthouse Lighthouse-Pulse/v2.3.0-de8e0a0
$ /opt/geth/build/bin/geth --version
geth version 3.0.0-pulse-stable-7975e02e
@rhmaxdotorg
Just to clarify, are you suggesting any code changes to the update script? Yes, will add more details at the end.
Or is adding the
sudo -u $NODE_USER bash...
line to update-client.sh only necessary if someone uses the helper script you shared?
Actually yes, turns out the symbolic link created by the setup script here: https://github.com/rhmaxdotorg/pulsechain-validator/blob/646a642d7414c3fbebafcb02f5ab4dcc4c338afb/pulsechain-validator-setup.sh#L201C8-L201C8 was being deleted since it is an untracked file, I have updated my comment above to add it to the .gitignore
file temporarily so that it is not cleaned up.
As the steps of setting up a validator with the setup script and then running the update-client.sh script after a new version is released shouldn't affect the lh link, but there's a few pieces to this and the setup and update script have some differences, so maybe add this before Starting PulseChain clients @ https://github.com/rhmaxdotorg/pulsechain-validator/blob/main/update-client.sh#L54C1-L54C48 sudo -u $NODE_USER ln -s /opt/lighthouse/target/release/lighthouse /opt/lighthouse/lighthouse/lh
Thanks for the MISC section, I have confirmed my suspicions as it turns out, the Makefile for lighthouse (or Rust itself) builds the binary in two places: /opt/lighthouse/target/release/lighthouse
and the place used in the setup script /home/$NODE_USER/.cargo/bin/lighthouse
node:~$ .cargo/bin/lighthouse --version
Lighthouse Lighthouse-Pulse/v2.3.0-de8e0a0
BLS library: blst
SHA256 hardware acceleration: true
Allocator: jemalloc
Profile: release
Specs: mainnet (true), minimal (false), gnosis (false), pulsechain (true)
node:~$ /opt/lighthouse/target/release/lighthouse --version
Lighthouse Lighthouse-Pulse/v2.3.0-de8e0a0
BLS library: blst
SHA256 hardware acceleration: true
Allocator: jemalloc
Profile: release
Specs: mainnet (true), minimal (false), gnosis (false), pulsechain (true)
TLDR; everything looks good, there is no issue with setup script approach of symbolic link, as long as we don't clear the untracked file
I will propose and if time permits I will raise a merge request, to reduce the downtime while upgrading the clients, the idea is simple, let's leverage that we can "rug" the binaries while these are running and start the build process without stopping the clients, after that, let's restart all 3 clients and they will start back up using the updated binaries. I still need to test this, but AFAIK it should work no problem, please let me know your thoughts.
@nicogranuja excellent, thank you!
Just one more thing to clarify:
Do you think sudo -u $NODE_USER ln -s /opt/lighthouse/target/release/lighthouse /opt/lighthouse/lighthouse/lh
needs to be added to update-client.sh
or it will work fine as-is?
Again, not sure if you tested this scenario yet and some of these scenarios are harder for me to test than others.
Do you think sudo -u $NODE_USER ln -s /opt/lighthouse/target/release/lighthouse /opt/lighthouse/lighthouse/lh needs to be added to update-client.sh or it will work fine as-is?
No, the current symbolic link should work just fine, I verified that both /opt/lighthouse/target/release/lighthouse
and the symbolic link pointer: /home/$NODE_USER/.cargo/bin/lighthouse
binaries are in fact the same, so everything should work the same, the only reason why I had to do it is because I was running a git reset --hard
which removed the symbolic link from the lighthouse repo, so I had to recreate it. By the way I updated my scrappy fix script in my comments above to also add this link to the .gitignore
so it can be run safely without having to alter the update-client.sh
script
Gotcha, that makes sense!
Appreciate the details and the script that gives people options to get the clients up to date (if using the old setup script): 1) "quick rebuild" the validator 2) "git reset" script you wrote
I'll close this thread since it seems like we've captured a lot of the important notes and feedback, but feel free to ping it or a new thread if more stuff or ideas.
HI,
I tried to use the update script with this and I ended up with a few issues, would appreciate some help if possible.
Firstly, the script wouldnt run as line 17 has an issue:
So I just commended out this line:
So I dont think it ended up doing anything.
Thanks