Open olivier-lam opened 4 months ago
Hi @olivier-lam what this means is that neo4j tries to mark a certain (large) string property value record as deleted, one which is already marked as deleted. Not sure why this happens, it should never do of course. Would it be possible to take a look at the transaction log files for this db (i.e. files in data/transactions/neo4j/*
, and also the data/databases/neo4j/neostore.propertystore.db.strings.id
? If so then it may be possible to figure out the root cause of it.
Hi @tinwelint,
Thanks to take my issue, I hope we will find what's happen. I should be able to consult those file around at 4PM (Paris time), because yesterday I have relaunched the full ingestion of data and it is still goes on.
May you indicate me how I can consult theses files, last time I try to read "data/databases/neo4j/*" it seems to me that theses files was encoded or store in a proprietary way and also what should I look for in data/databases/neo4j/neostore.propertystore.db.strings.id?
Hi @tinwelint
As expected, the issue was reproduced few hours ago. May you help me to read files that you have mentionned earlier? How can I read this file : data/databases/neo4j/neostore.propertystore.db.strings.id
Below a screenshot of transactions folder
Hello guys,
Any suggestion ?
Thanks
@olivier-lam the files are in some binary format, so tools are required to read them. Do you want to be able to read those files yourself? There's a chance I can figure out what's causing this if you could somehow send me those files, or make them available somewhere. And yes it's those neostore.transaction.db.xyz
files and the neostore.propertystore.db.strings.id
file that would be of interest for this.
@tinwelint Currently, I do not think I can send you theses files regarding our privacy rules. May you indicate me the tools that I can use to open this kind of files and also what should I look for to figure out what could be the issue? Today I found tools on 3.5 version but none for my current version 5.X. And still have not try to use theses tools on our files.
Thanks for your help.
@tinwelint May you indicate a way to read those files? I can then check if there is business data or not inside them to take the decision if I can send it to you.
@olivier-lam the transaction logs contain all changes that is stored in the database, and as such contains business data. What I'd look for is which transactions tries to delete that certain string value record after another transaction before also deletes it. Code for this exists in community edition, but there are higher level tools (that aren't publically available tho) to instantiate the right components involved in doing this reading. Essentially it reads the transaction commands and prints them as somewhat human readable form.
Now that I look at the stacktrace and the specific ID I see that it's very very close to a "reserved" ID that has some legacy meaning and that gets skipped, due to internally meaning "null". It could be logic around this reserved ID that somehow is faulty and causes this. Do you have the stacktrace for the other failures that you ran into? Perhaps it's the same ID and if so I'd say I can try to reproduce your problem. At least I can give this a go first!
A solution in the meantime is to try the new storage format (block format) and see if it works there. Although it's only available in enterprise edition. See https://neo4j.com/docs/operations-manual/current/database-internals/store-formats/
@tinwelint By the way we have also met sometimes the issue with "ReservedID" in few weeks ago. We thought that it is related to issue when we stop our instance with timeout exception, that's why we have decided to reimport all our data from scratch and it seem's work pretty well when we have around 300M nodes but with time (maybe a lot of more nodes) we met met the issue with the stack trace on my first message.
Concerning "Do you have the stacktrace for the other failures that you ran into" what do you mean, I do not have an another error on same time? Maybe I miss something.
For the new storage, we do not have licence. I will take one temporary (30 days to try it). Btw, I try to ingest all data with the version 5.14, I should have results around 6 P.M.
In the meantime, is it possible to get this higher level tools to troubleshoot, please?
Have a nice day.
@tinwelint By the way we have also met sometimes the issue with "ReservedID" in few weeks ago. We thought that it is related to issue when we stop our instance with timeout exception, that's why we have decided to reimport all our data from scratch and it seem's work pretty well when we have around 300M nodes but with time (maybe a lot of more nodes) we met met the issue with the stack trace on my first message.
Concerning "Do you have the stacktrace for the other failures that you ran into" what do you mean, I do not have an another error on same time? Maybe I miss something.
As I understood it you got this error more than once, which is why I was asking.
For the new storage, we do not have licence. I will take one temporary (30 days to try it). Btw, I try to ingest all data with the version 5.14, I should have results around 6 P.M.
In the meantime, is it possible to get this higher level tools to troubleshoot, please?
Have a nice day.
I think we can sidestep the problem of the tooling and analyzing the transaction logs actually, since I just now managed to reproduce this exact issue in these high ranges for that store. I'll get to work on a fix and let you know how it goes, OK?
Very good news! If you have some details I would very appreciate to know it after the bug fixing. I have look neo4j's code but it's very hard to quite understand with my current neo4j' knowledge.
if still needed I can send you other stacktrace , but is seems to me that is same each time.
Hello @tinwelint,
Is this issue related to Neo4j Entreprise Edition to ?
Thank you for your help. We appreciate. And if you want, we can be a beta-tester of your fix :-)
@nizarsalhaji94 yes it affects Enterprise Edition too for versions 5.19, 5.20 and 5.21, but not on 5.21 for databases created in the block
format.
Thanks @tinwelint,
Good to know. Since we start ingest our data with Neo4j 5.14 Community Edition and will tell you if we meet the same errors.
Hey @tinwelint, Btw I confirm that on 5.14 et 5.18 I have not reproduced my issue. In order to plan how we handle our production environment do you think the fix will be delivered this week or it will need few weeks to stabilize it?
Again thanks a lot for your help.
Hello again, great to have that confirmed. The fix is merged and will be included in the soon upcoming 5.22!
Hi @tinwelint Very good news thank you very much for your help. May I know if there will be any kind of procedure to fix the issue (fix the data then install the 5.22) or just migrate to 5.22 when it will be available.
olivier
There's no included fix databases that have hit the issue, except this manual labor:
If there have been write transactions after the recovery w/o doing the other steps first then the db may have produced internal inconsistent records. Hope that helps! And sorry for inconvenience regarding this issue.
Dear @tinwelint
It's been a long time :p. We just see that Neo4j release a new version 5.22 today. May you confirm us that our issue is fixed dans the current ticket should also be closed. If it fixed may you join us the commit hash in order we look where you fixed it.
Thanks a lot for your feedbacks
olivier
Hi, We have a critical issue when we ingest data to our Neo4 instance. We start ingestion from a newly instance of neo4j and succeed to ingesting around 300 millions nodes in 20/24 hours, but at one point an error occurs during the end of our ingestion (See Stacktrace as attachment).
- Neo4j version: 5.20.0 - Operating system: Linux - API/Driver: Neo4j Java Driver - Information : We parallelize our ingestion with 300 threads and our cypher query looks like something like that CALL { Merge ... Set .... ....
} IN TRANSACTIONS OF 1 ROW ON ERROR FAIL RETURN *
We also have around 700 different indexes. Disk : Still have 700Go free Memory : No seems to have an issue related with memory
Steps to reproduce Do not really have a real step to reproduce :We just import a lot of data
Stacktrace stracktrace_neo4j.txt
Do you think it could it be a race issue ? but we should met this before during the previous 20 hours.... It seems issue occurs on merge process on indexes.
Thank you so much for the help! We are completely blocked with this issue