opentensor / subtensor

Bittensor Blockchain Layer
The Unlicense
145 stars 149 forks source link

Testnet Node Failures #444

Closed distributedstatemachine closed 4 months ago

distributedstatemachine commented 4 months ago

Description

We are experiencing an issue where our testnet nodes fail with the following error:

2024-05-21 00:38:27.132  INFO main sc_cli::runner: 💾 Database: RocksDb at /var/lib/subtensor/chains/bittensor/db/full    
2024-05-21 00:38:27.132  INFO main sc_cli::runner: ⛓  Native runtime: node-subtensor-143 (node-subtensor-1.tx1.au1)    
Error: Service(Client(Backend("IO error: While lock file: /var/lib/subtensor/chains/bittensor/db/full/LOCK: Resource temporarily unavailable")))

This error suggests a potential race condition in the database locking mechanism. A similar issue has been reported in the Substrate community, as seen in this GitHub issue.

We need to investigate the root cause of this issue and determine if it is related to the version of Substrate we are using. Additionally, we should verify if upgrading to version v1.1.0 resolves the problem.

Acceptance Criteria

Tasks

Additional Considerations

Related Links

open-junius commented 4 months ago

I can get the same error in my local, the error message is like "Error: Service(Client(Backend("IO error: While lock file: /tmp/alice/chains/bittensor/db/full/LOCK: Resource temporarily unavailable")))".

How to reproduce:

  1. start the alice node with /tmp/alice
  2. start the bob with the same folder.
open-junius commented 4 months ago

Next step, will check if the db is used by any zombie process in the validator node.

distributedstatemachine commented 4 months ago

I can get the same error in my local, the error message is like "Error: Service(Client(Backend("IO error: While lock file: /tmp/alice/chains/bittensor/db/full/LOCK: Resource temporarily unavailable")))".

How to reproduce:

  1. start the alice node with /tmp/alice
  2. start the bob with the same folder.

@open-junius thanks alot for this . I would suggest we close this for now , as I am fairly certain this error is due to pm2 trying to restart a failed node too early. The failed node still has a mutex on the db folder hence this .