Testnet Node Failures - Githubissues

distributedstatemachine commented 4 months ago

Description

We are experiencing an issue where our testnet nodes fail with the following error:

2024-05-21 00:38:27.132  INFO main sc_cli::runner: 💾 Database: RocksDb at /var/lib/subtensor/chains/bittensor/db/full    
2024-05-21 00:38:27.132  INFO main sc_cli::runner: ⛓  Native runtime: node-subtensor-143 (node-subtensor-1.tx1.au1)    
Error: Service(Client(Backend("IO error: While lock file: /var/lib/subtensor/chains/bittensor/db/full/LOCK: Resource temporarily unavailable")))

This error suggests a potential race condition in the database locking mechanism. A similar issue has been reported in the Substrate community, as seen in this GitHub issue.

We need to investigate the root cause of this issue and determine if it is related to the version of Substrate we are using. Additionally, we should verify if upgrading to version v1.1.0 resolves the problem.

Acceptance Criteria

Confirm the presence of the bug in the current Substrate version.
Investigate the root cause of the database lock error.
Test if the issue persists in Substrate version v1.1.0.
Provide a detailed report of findings and potential fixes.

Tasks

[x] Reproduce the error in the current environment.
[ ] Investigate the root cause of the database lock error.
[ ] Upgrade to Substrate version v1.1.0 and test for the issue.
[ ] Document findings and suggest potential fixes.

Additional Considerations

Ensure that the investigation includes checking for race conditions in the database locking mechanism.
Consider implementing a workaround if the issue persists in the latest version.

Related Links

GitHub Issue: Race when use multiple parachains in native (default concurrency)

open-junius commented 4 months ago

I can get the same error in my local, the error message is like "Error: Service(Client(Backend("IO error: While lock file: /tmp/alice/chains/bittensor/db/full/LOCK: Resource temporarily unavailable")))".

How to reproduce:

start the alice node with /tmp/alice
start the bob with the same folder.

open-junius commented 4 months ago

Next step, will check if the db is used by any zombie process in the validator node.

distributedstatemachine commented 4 months ago

I can get the same error in my local, the error message is like "Error: Service(Client(Backend("IO error: While lock file: /tmp/alice/chains/bittensor/db/full/LOCK: Resource temporarily unavailable")))".

How to reproduce:

start the alice node with /tmp/alice

start the bob with the same folder.

@open-junius thanks alot for this . I would suggest we close this for now , as I am fairly certain this error is due to pm2 trying to restart a failed node too early. The failed node still has a mutex on the db folder hence this .

opentensor / subtensor

Testnet Node Failures #444

Description

Acceptance Criteria

Tasks

Additional Considerations

Related Links