neo-project / neo

NEO Smart Economy
MIT License
3.46k stars 1.03k forks source link

Leveldb exception handle #3356

Closed vncoelho closed 2 days ago

vncoelho commented 2 months ago

Describe the bug Run a setup with 4 nodes running private net

To Reproduce Steps to reproduce the behavior: Start nodes and they will crash almost instantaneously

Error

dotnet: ./db/dbformat.cc:16: uint64_t leveldb::PackSequenceAndType(uint64_t, leveldb::ValueType): Assertion `seq <= kMaxSequenceNumber' failed.
cschuchardt88 commented 2 months ago

Need more information.

vncoelho commented 2 months ago

Need more information.

description updated

shargon commented 2 months ago

Seems that the data is corrupted, it's a fresh installation?

vncoelho commented 2 months ago

fresh with master

vncoelho commented 2 months ago

Seems that the data is corrupted, it's a fresh installation?

probably due to the unhanded exception management feature, but still did not investigate further. It is easy to reproduce. Just run a node.

Hecate2 commented 2 months ago

Is it because the 4 nodes are using the same directory for leveldb?

cschuchardt88 commented 2 months ago

Based off the source code from you error, It look like this Your database is corrupt. try deleting it to see if the problem goes away.

Has to do with Seeking with KeyComparator source code says

// User key has become shorter physically, but larger logically.
// Tack on the earliest possible number to the shortened user key.
vncoelho commented 2 months ago

Based off the source code from you error, It look like this Your database is corrupt. try deleting it to see if the problem goes away.

Has to do with Seeking with KeyComparator source code says

// User key has become shorter physically, but larger logically.
// Tack on the earliest possible number to the shortened user key.

No @cschuchardt88 , it is a recent introduced problem.

Jim8y commented 2 months ago

its because you run too many nodes in the same machine that all use leveldb. Not a core problem. This happens every time when you run multiple nodes in the same machine.

vncoelho commented 2 months ago

its because you run too many nodes in the same machine that all use leveldb. Not a core problem. This happens every time when you run multiple nodes in the same machine.

No. This is not true in my Setup.

vncoelho commented 2 months ago

Too much complaints and not a real investigation in a simple scenario. The cause is that we now crash the clients with unhandled exception.

Without minimum tests the neo-cli will be unused until we implement the exception handle and find the BASIC problems.

vncoelho commented 2 months ago

https://github.com/neo-project/neo/pull/3366#issuecomment-2197371062

Jim8y commented 2 months ago

Too much complaints and not a real investigation in a simple scenario.

You can say this when you locate the real problem.

We have being working like this for many years, and all of a sudden its all wrong, we all become complainers? And our work are lack of investigation products? But we definitely have tested it, checked it everywhere, and for this one, i have run the node~~~~ And i have asked help from NGD to test it as well.

But code were there, pr were there, you were able to test, to review, to comment. We have followed your suggestion to leave it for a while to review. Actually that pr was there for a week before i collected sufficient review approvals.

Before we release any new version, we still can correct any problem, so chill. A team means even some one made some problem, some one else can correct it, isn't it?

The cause is that we now crash the clients with unhandled exception.

Funny part is we should have crashed with unhandled exception, unless we have set plugins to ignore unhandled exception. I would say that pr have found an issue, if any, instead of introduced an issue.

BTW, i admit that even if i run the test on my machine, i at most run a single node,,,,, i dont have a 4 nodes private net test environment. I will create one.

AnnaShaleva commented 2 months ago

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

I'd suggest you to use NeoBench, but it's not yet updated to use fresh monorepo, we have https://github.com/nspcc-dev/neo-bench/issues/175 for that.

vncoelho commented 2 months ago

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

I'd suggest you to use NeoBench, but it's not yet updated to use fresh monorepo, we have https://github.com/nspcc-dev/neo-bench/issues/175 for that.

Are you using leveldb? Maybe it was rocksdb instead.

Were your experiments with master branch? Mine just run now reverting the exception handle crash.

cschuchardt88 commented 2 months ago

@vncoelho Are you sure you didn't run out storage (disk space)? Why don't give #3355 a try?

cschuchardt88 commented 2 months ago

Try doing ./neo-cli /repair or neo-cli.exe /repair

vncoelho commented 2 months ago

Try doing ./neo-cli /repair or neo-cli.exe /repair

This is not the case, @cschuchardt88 .

The testing environment is the same for testing with and without the PR being reverted. The problem is that leveldb probably regenerates from the crash, but the PR that handles exception detects it and then crash the client.

The behavior may not the wrong. But before merging that PR this should had been tested because the problem is simple to be seen. Can you verify that @superboyiii ?

cschuchardt88 commented 2 months ago

Try with this version of LevelDbStore #3274

Jim8y commented 2 months ago

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

i would love to argue, but i am not an expert of leveldb, all i can

say is now it happened, and apparently a leveldb exception, not related to the core.

possible reasons could be: platform, os, version, dependencies. i would suggest to try rockdb and memorydb as well.

vncoelho commented 1 month ago

its because you run too many nodes in the same machine that all use leveldb

It was not a problem for me either, I used NeoBench to run 4-nodes and 7-nodes privnet with Dockerized C# nodes on my single machine, and it was OK.

i dont have a 4 nodes private net test environment.

i would love to argue, but i am not an expert of leveldb, all i can

say is now it happened, and apparently a leveldb exception, not related to the core.

possible reasons could be: platform, os, version, dependencies. i would suggest to try rockdb and memorydb as well.

So, this error without the Exception Handle was good and safe to run a node? Now, after the PR the node is broken, right?Is it not a core problem?

cschuchardt88 commented 1 month ago

It's a corruption problem.

We need more information on your setup :

  1. are you using a container?
  2. what version of leveldb you have?
  3. what CI build you using?
  4. What filesystem?
  5. What Operating System?
  6. What CPU arch?
  7. Have you tried leveldb `repair?
  8. How many threads does you OS limit?
  9. Have you ran filesystem repair tool?
  10. Does this happen on other setups?
  11. What's your node setup?
vncoelho commented 1 month ago
1. are you using a `container`?

Yes

2. what `version` of `leveldb` you have?

Master compiled plugin and libleveldb-dev from apt get mcr.microsoft.com/dotnet/aspnet:8.0.3-jammy

it is all dockerfile in a container with the amount of threads that is necessary for it to run safe. It usually run a node on mainnet with the resources it have available. It is running perfect without the commit I said that should be reverted until fixed.

The problem could be due to some limitation on leveldb safe off course. But that should be handled before the PR was merged. Furthermore, In my last tests rocksdb was also broken.

Only way to run a node nowdays is memorystore.

vncoelho commented 1 month ago

Still crashing. I thought it was solved but my config was with "MemoryStore" instead.

The problem persist even updating all libraries for dotnet during build and run.

RocksDb is also corrupted. But perhaps a difference reason.

Jim8y commented 1 month ago

I will setup a multi-nodes on my machine, will check it.

gsmachado commented 1 month ago

not entirely related, but see https://github.com/neo-project/neo-express/issues/455

cschuchardt88 commented 2 days ago

fixed