neo4j / neo4j

Graphs for Everyone
http://neo4j.com
GNU General Public License v3.0
13.29k stars 2.38k forks source link

Critical Error - Illegal addition ID state for range on GBPTree Merge #13478

Open olivier-lam opened 4 months ago

olivier-lam commented 4 months ago

Hi, We have a critical issue when we ingest data to our Neo4 instance. We start ingestion from a newly instance of neo4j and succeed to ingesting around 300 millions nodes in 20/24 hours, but at one point an error occurs during the end of our ingestion (See Stacktrace as attachment).

- Neo4j version: 5.20.0 - Operating system: Linux - API/Driver: Neo4j Java Driver - Information : We parallelize our ingestion with 300 threads and our cypher query looks like something like that CALL { Merge ... Set .... ....

} IN TRANSACTIONS OF 1 ROW ON ERROR FAIL RETURN *

We also have around 700 different indexes. Disk : Still have 700Go free Memory : No seems to have an issue related with memory

Steps to reproduce Do not really have a real step to reproduce :We just import a lot of data

Stacktrace stracktrace_neo4j.txt

Do you think it could it be a race issue ? but we should met this before during the previous 20 hours.... It seems issue occurs on merge process on indexes.

Thank you so much for the help! We are completely blocked with this issue

tinwelint commented 4 months ago

Hi @olivier-lam what this means is that neo4j tries to mark a certain (large) string property value record as deleted, one which is already marked as deleted. Not sure why this happens, it should never do of course. Would it be possible to take a look at the transaction log files for this db (i.e. files in data/transactions/neo4j/*, and also the data/databases/neo4j/neostore.propertystore.db.strings.id? If so then it may be possible to figure out the root cause of it.

olivier-lam commented 4 months ago

Hi @tinwelint,

Thanks to take my issue, I hope we will find what's happen. I should be able to consult those file around at 4PM (Paris time), because yesterday I have relaunched the full ingestion of data and it is still goes on.

May you indicate me how I can consult theses files, last time I try to read "data/databases/neo4j/*" it seems to me that theses files was encoded or store in a proprietary way and also what should I look for in data/databases/neo4j/neostore.propertystore.db.strings.id?

olivier-lam commented 4 months ago

Hi @tinwelint

As expected, the issue was reproduced few hours ago. May you help me to read files that you have mentionned earlier? How can I read this file : data/databases/neo4j/neostore.propertystore.db.strings.id

Below a screenshot of transactions folder

transactions
nizarsalhaji94 commented 3 months ago

Hello guys,

Any suggestion ?

Thanks

tinwelint commented 3 months ago

@olivier-lam the files are in some binary format, so tools are required to read them. Do you want to be able to read those files yourself? There's a chance I can figure out what's causing this if you could somehow send me those files, or make them available somewhere. And yes it's those neostore.transaction.db.xyz files and the neostore.propertystore.db.strings.id file that would be of interest for this.

olivier-lam commented 3 months ago

@tinwelint Currently, I do not think I can send you theses files regarding our privacy rules. May you indicate me the tools that I can use to open this kind of files and also what should I look for to figure out what could be the issue? Today I found tools on 3.5 version but none for my current version 5.X. And still have not try to use theses tools on our files.

Thanks for your help.

olivier-lam commented 3 months ago

@tinwelint May you indicate a way to read those files? I can then check if there is business data or not inside them to take the decision if I can send it to you.

tinwelint commented 3 months ago

@olivier-lam the transaction logs contain all changes that is stored in the database, and as such contains business data. What I'd look for is which transactions tries to delete that certain string value record after another transaction before also deletes it. Code for this exists in community edition, but there are higher level tools (that aren't publically available tho) to instantiate the right components involved in doing this reading. Essentially it reads the transaction commands and prints them as somewhat human readable form.

Now that I look at the stacktrace and the specific ID I see that it's very very close to a "reserved" ID that has some legacy meaning and that gets skipped, due to internally meaning "null". It could be logic around this reserved ID that somehow is faulty and causes this. Do you have the stacktrace for the other failures that you ran into? Perhaps it's the same ID and if so I'd say I can try to reproduce your problem. At least I can give this a go first!

A solution in the meantime is to try the new storage format (block format) and see if it works there. Although it's only available in enterprise edition. See https://neo4j.com/docs/operations-manual/current/database-internals/store-formats/

olivier-lam commented 3 months ago

@tinwelint By the way we have also met sometimes the issue with "ReservedID" in few weeks ago. We thought that it is related to issue when we stop our instance with timeout exception, that's why we have decided to reimport all our data from scratch and it seem's work pretty well when we have around 300M nodes but with time (maybe a lot of more nodes) we met met the issue with the stack trace on my first message.

Concerning "Do you have the stacktrace for the other failures that you ran into" what do you mean, I do not have an another error on same time? Maybe I miss something.

For the new storage, we do not have licence. I will take one temporary (30 days to try it). Btw, I try to ingest all data with the version 5.14, I should have results around 6 P.M.

In the meantime, is it possible to get this higher level tools to troubleshoot, please?

Have a nice day.

tinwelint commented 3 months ago

@tinwelint By the way we have also met sometimes the issue with "ReservedID" in few weeks ago. We thought that it is related to issue when we stop our instance with timeout exception, that's why we have decided to reimport all our data from scratch and it seem's work pretty well when we have around 300M nodes but with time (maybe a lot of more nodes) we met met the issue with the stack trace on my first message.

Concerning "Do you have the stacktrace for the other failures that you ran into" what do you mean, I do not have an another error on same time? Maybe I miss something.

As I understood it you got this error more than once, which is why I was asking.

For the new storage, we do not have licence. I will take one temporary (30 days to try it). Btw, I try to ingest all data with the version 5.14, I should have results around 6 P.M.

In the meantime, is it possible to get this higher level tools to troubleshoot, please?

Have a nice day.

I think we can sidestep the problem of the tooling and analyzing the transaction logs actually, since I just now managed to reproduce this exact issue in these high ranges for that store. I'll get to work on a fix and let you know how it goes, OK?

olivier-lam commented 3 months ago

Very good news! If you have some details I would very appreciate to know it after the bug fixing. I have look neo4j's code but it's very hard to quite understand with my current neo4j' knowledge.

if still needed I can send you other stacktrace , but is seems to me that is same each time.

nizarsalhaji94 commented 3 months ago

Hello @tinwelint,

Is this issue related to Neo4j Entreprise Edition to ?

Thank you for your help. We appreciate. And if you want, we can be a beta-tester of your fix :-)

tinwelint commented 3 months ago

@nizarsalhaji94 yes it affects Enterprise Edition too for versions 5.19, 5.20 and 5.21, but not on 5.21 for databases created in the block format.

nizarsalhaji94 commented 3 months ago

Thanks @tinwelint,

Good to know. Since we start ingest our data with Neo4j 5.14 Community Edition and will tell you if we meet the same errors.

olivier-lam commented 3 months ago

Hey @tinwelint, Btw I confirm that on 5.14 et 5.18 I have not reproduced my issue. In order to plan how we handle our production environment do you think the fix will be delivered this week or it will need few weeks to stabilize it?

Again thanks a lot for your help.

tinwelint commented 3 months ago

Hello again, great to have that confirmed. The fix is merged and will be included in the soon upcoming 5.22!

olivier-lam commented 3 months ago

Hi @tinwelint Very good news thank you very much for your help. May I know if there will be any kind of procedure to fix the issue (fix the data then install the 5.22) or just migrate to 5.22 when it will be available.

olivier

tinwelint commented 3 months ago

There's no included fix databases that have hit the issue, except this manual labor:

If there have been write transactions after the recovery w/o doing the other steps first then the db may have produced internal inconsistent records. Hope that helps! And sorry for inconvenience regarding this issue.

olivier-lam commented 3 months ago

Dear @tinwelint

It's been a long time :p. We just see that Neo4j release a new version 5.22 today. May you confirm us that our issue is fixed dans the current ticket should also be closed. If it fixed may you join us the commit hash in order we look where you fixed it.

Thanks a lot for your feedbacks

olivier