sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.94k stars 745 forks source link

VC blocks on `SlashingDatabase::open` when running with NSSM on Windows #2394

Open remyroy opened 3 years ago

remyroy commented 3 years ago

Description

VC blocks on SlashingDatabase::open when running with NSSM as a service on Windows. It does not run properly and it cannot attest. It does not leave SlashingDatabase::open.

Version

Lighthouse v1.4.0-rc.0-f6280aa BLS Library: blst Specs: mainnet (true), minimal (false), v0.12.3 (false)

Unstable Windows 10 (10.0.19043 Build 19043) rustc 1.52.1 (9bc8c42bb 2021-05-09) commit f6280aa66308bbec590f3c1f2857f46d79b4af94 Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30037 for x64 NSSM 2.24-101-g897c7ad 64-bit 2017-04-26

Present Behaviour

When running with NSSM as a service on Windows, VC starts, displays the follow logs:

Jun 03 09:27:58.374 INFO Lighthouse started                      version: Lighthouse/v1.4.0-rc.0-f6280aa
Jun 03 09:27:58.375 INFO Configured for network                  name: prater
Jun 03 09:27:58.375 INFO Starting validator client               validator_dir: "C:\\ethereum\\var\\lib\\lighthouse\\validator\\validators", beacon_nodes: ["http://localhost:5051/"]
Jun 03 09:27:58.375 INFO HTTP metrics server is disabled
Jun 03 09:27:58.378 INFO Completed validator discovery           new_validators: 0
Jun 03 09:27:59.361 INFO Enabled validator                       voting_pubkey: 0x8fbb8e380977350eac38a66903c09b67f12f7f7794276d3e997b427f0bfb24180ca2deacb6da907a856d770babf268ff
Jun 03 09:28:00.283 INFO Modified key_cache saved successfully
Jun 03 09:28:00.283 INFO Initialized validators                  enabled: 1, disabled: 0

and stops/blocks. When debugging a little further, it blocks when entering SlashingDatabase::open.

Expected Behaviour

VC should run fine even under NSSM as a service on Windows just like it does when it does not run under NSSM.

Steps to reproduce

  1. Download and execute the Microsoft C++ Build Tools installer.
  2. Check the following checkboxes:
    • C++/CLI support for v142 build tools (Latest)
    • MSVC v142 - VS 2019 C++ x64/x86 build tools (Latest)
    • Windows 10 SDK (10.0.19041.0)
  3. Download and execute the Rust installer for Windows.
  4. Download and execute Git for Windows.
  5. Open a PowerShell Prompt as Administrator (Press ⊞ Win+R, type powershell, press Ctrl+⇧ Shift+↵ Enter and click Yes at the User Account Control window)
  6. Copy and paste the following command in your PowerShell Prompt and press ↵ Enter:
    Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
  7. Once Chocolatey is installed, close your PowerShell Prompt window.
  8. Open a Command Prompt as Administrator (Press ⊞ Win+R, type cmd, press Ctrl+⇧ Shift+↵ Enter and click Yes at the User Account Control window)
  9. Type each of these commands in your Command Prompt window (each line is a different command, you must press ↵ Enter at the end of the line):
    choco install nssm
    choco install make
    choco install cmake --installargs 'ADD_CMAKE_TO_PATH=System'
  10. During the execution of that last command, you will be promted to run a script. Press Y and press ↵ Enter to run it.
  11. Close your Command Prompt window.
  12. Open a normal Command Prompt (Press ⊞ Win+R, type cmd, press ↵ Enter).
  13. Type each of these commands in your Command Prompt window (each line is a different command, you must press ↵ Enter at the end of the line):
    git clone https://github.com/sigp/lighthouse.git
    cd lighthouse
    git checkout unstable
    make
    mkdir c:\ethereum\bin
    copy %UserProfile%\.cargo\bin\lighthouse.exe c:\ethereum\bin
    mkdir C:\ethereum\var\log
    mkdir C:\ethereum\var\lib\lighthouse\validator
  14. Generate a valid keystore for the Prater network and import it with something like:
    c:\ethereum\bin\lighthouse.exe account_manager validator import --network prater --datadir C:\ethereum\var\lib\lighthouse\validator --directory validator_keys
  15. Run a beacon node for Prater and make it available on http://localhost:5051
  16. Type each of these commands in your Command Prompt window (each line is a different command, you must press ↵ Enter at the end of the line):
    nssm install lighthousevalidator C:\ethereum\bin\lighthouse.exe vc --network prater --datadir C:\ethereum\var\lib\lighthouse\validator --beacon-nodes http://localhost:5051
    nssm set lighthousevalidator DisplayName "Lighthouse Validator Client (Prater)"
    nssm set lighthousevalidator AppRotateFiles 1
    nssm set lighthousevalidator AppRotateSeconds 86400
    nssm set lighthousevalidator AppRotateBytes 10485760
    nssm set lighthousevalidator AppStdout C:\ethereum\var\log\lighthousevalidator-service-stdout.log
    nssm set lighthousevalidator AppStderr C:\ethereum\var\log\lighthousevalidator-service-stderr.log
    nssm start lighthousevalidator
  17. Notice that the VC is blocked and does not work as intented. Inspect C:\ethereum\var\log\lighthousevalidator-service-stderr.log to find out that the last log message is something like:
    Jun 03 09:28:00.283 INFO Initialized validators                  enabled: 1, disabled: 0
remyroy commented 3 years ago

NSSM source code can be found on https://git.nssm.cc/nssm/nssm if that can help.

michaelsproul commented 3 years ago

I've experimented with NSSM and I think it's a file permissions issue. I couldn't recreate the exact hang that you got, but I did get this error when I tried starting after importing the key not as the administrator:

$ nssm start lighthousevalidator
lighthousevalidator: Unexpected status SERVICE_STOPPED in response to START control.

I think I also got a similar error importing the key as admin before starting the service (will have to recheck this on Monday). The flow that definitely worked was:

Let me know if that works for you.

The only suspect thing I found in our code was that we call Path::exists, which masks permissions errors. I'll switch it to using Path::metadata so that the permissions error surfaces in open_or_create.

For reference: https://doc.rust-lang.org/std/path/struct.Path.html#method.metadata

remyroy commented 3 years ago

It really seems like a file permission issue. For some reason the permissions on C:\ethereum\var\lib\lighthouse\validator\validators\slashing_protection.sqlite were not what I expected:

icacls.exe C:\ethereum\var\lib\lighthouse\validator\validators\slashing_protection.sqlite
C:\ethereum\var\lib\lighthouse\validator\validators\slashing_protection.sqlite OWNER RIGHTS:(R,W,D,WDAC,WO)

slashing_protection_perms

By adding the SYSTEM account (the account under which services normally run) with full control on the slashing_protection.sqlite file, I got the NSSM service to start correctly.

I just tested importing my validator keystore again with lighthouse.exe account_manager validator import --network prater --datadir C:\ethereum\var\lib\lighthouse\validator --directory validator_keys as a normal user and it is that process who creates the slashing_protection.sqlite file with these unexpected permissions that makes the vc blocs if you running it under a different account than the one who called the account_manager validator import command.

It would be nice if the VC would error out with a message instead of blocking if it does not have permission to access the slashing_protection.sqlite file on Windows. I'm not sure the default OWNER RIGHTS permission on the slashing_protection.sqlite file is needed. I think those permissions could be relaxed.

paulhauner commented 3 years ago

It would be nice if the VC would error out with a message instead of blocking if it does not have permission to access the slashing_protection.sqlite file on Windows.

It looks like we have a solution to this over in https://github.com/sigp/lighthouse/pull/2436 :tada:

remyroy commented 3 years ago

Seems great! Don't forget to relax the permissions on the slashing_protection.sqlite file. There is no need for them to be as restricted as they are now when created on Windows.

michaelsproul commented 3 years ago

Yeah, let's leave this issue open as a way to track the permissions changes for Windows

xanatos commented 2 years ago

I'll add that even the logger sets the log files a owner-only

https://github.com/sigp/lighthouse/blob/28aceaa213ad77a6a226242f12ea92b9fcac7342/lighthouse/environment/src/lib.rs#L223

This is quite annoying if you want to run lighthouse as a service, and then use another use to look at the logs.