Kuzu db can become inconsistent

lbuczkow commented 7 months ago

It seems kuzu database can become inconsistent and it is neither reported nor fixed by kuzu_shell in particular.

There is a kuzu database with the following structure:

CREATE NODE TABLE User(uid UUID, id INT64, PRIMARY KEY(uid)); 
CREATE NODE TABLE Note(uid UUID, id INT64, PRIMARY KEY(uid)); 
CREATE REL TABLE Links(FROM Note TO Note); 
CREATE REL TABLE NoteOf(FROM Note TO User);

The binary database files of the database are available here:

https://file.io/otuAQVrb4UHz

The database has been originally populated by importing User, Note and Links from .csv files while NoteOf has been created using a node.js script. The script has been interrupted several times. The goal of the script was to create a link between notes and users so as each note belonged to some user. In the end, the database contains: 1000 users, 1000000 notes, about 4000000 links, and is supposed to contain 1000000 NoteOf relationships.

The following query shows that no note remains unassigned:

kuzu> MATCH (n:Note) WHERE NOT EXISTS { MATCH (n)-[:NoteOf]->(:User) } RETURN count(n);
----------------
| COUNT(n._ID) |
----------------
| 0            |
----------------
(1 tuple)
(1 column)
Time: 3.65ms (compiling), 803.85ms (executing)

However, after exporting NoteOf using:

COPY (MATCH (n:Note)-[:NoteOf]->(u:User) RETURN n.uid, u.uid) TO 'note_of.csv';

it appeared that only 999990 relationships out of 1000000 has been exported.

At first I thought it was a problem with export but I imported the exported .csv files into a new database and found the missing IDs:

[743642,853676,684672,853633,992666,992667,602411,755386,831275,643211]

I applied them (plus 1, 2, 3 as some "good" IDs for testing) to the following queries against the original database:

kuzu> MATCH (n:Note) WHERE n.id in [1, 2, 3, 743642,853676,684672,853633,992666,992667,602411,755386,831275,643211] RETURN n.uid, n.id;
-------------------------------------------------
| n.uid                                | n.id   |
-------------------------------------------------
| a017c165-98ce-4f17-9761-cb822b8b2dfe | 684672 |
-------------------------------------------------
| 85f0807e-a588-4db6-866c-d683befd34cc | 755386 |
-------------------------------------------------
| e2bb50df-ccc6-471f-a7e0-a50a17341164 | 743642 |
-------------------------------------------------
| 89dd8aa3-117a-4328-b316-25866d5f4b2b | 831275 |
-------------------------------------------------
| bdf820b4-b188-41b3-b336-f9f1eb94d662 | 853633 |
-------------------------------------------------
| c6c4c6bc-22a7-446f-a838-ab65497866aa | 853676 |
-------------------------------------------------
| 5e8c1059-533c-467b-9bfd-ab7bdcf76896 | 1      |
-------------------------------------------------
| ba05b55b-3d3b-4873-a37b-397581d45e8d | 2      |
-------------------------------------------------
| 2c24f177-ebe0-4695-91f9-5a6d84f379f3 | 3      |
-------------------------------------------------
| 48c27540-c017-4bba-a86b-ac6bc8ff1ae5 | 992666 |
-------------------------------------------------
| bd199b09-53f4-402b-bd56-b9478b35c849 | 992667 |
-------------------------------------------------
| 2b25ecf7-b705-4335-b3bd-4ba867b85be6 | 602411 |
-------------------------------------------------
| 6f0e78d7-c1af-434b-b4c9-cc1da87c7ad8 | 643211 |
-------------------------------------------------
(13 tuples)
(2 columns)
Time: 0.47ms (compiling), 49.18ms (executing)

The query returned all the specified notes as expected. But the next query:

kuzu> MATCH (n:Note)-[:NoteOf]->(u:User) WHERE n.id in [1, 2, 3, 743642,853676,684672,853633,992666,992667,602411,755386,831275,643211] RETURN n.uid, n.id, u.id;
------------------------------------------------------
| n.uid                                | n.id | u.id |
------------------------------------------------------
| 5e8c1059-533c-467b-9bfd-ab7bdcf76896 | 1    | 176  |
------------------------------------------------------
| ba05b55b-3d3b-4873-a37b-397581d45e8d | 2    | 954  |
------------------------------------------------------
| 2c24f177-ebe0-4695-91f9-5a6d84f379f3 | 3    | 176  |
------------------------------------------------------
(3 tuples)
(3 columns)
Time: 1.87ms (compiling), 18.75ms (executing)

returned only "good" IDs.

Summarizing:

MATCH (n:Note) WHERE NOT EXISTS { MATCH (n)-[:NoteOf]->(:User) } RETURN count(n);

reports there are no unconnected notes, but:

MATCH (n:Note)-[:NoteOf]->(u:User) WHERE n.id in [1, 2, 3, 743642,853676,684672,853633,992666,992667,602411,755386,831275,643211] RETURN n.uid, n.id, u.id;

reveals there exist some unconnected notes.

I used kuzu_shell and kuzujs.node built from source on 23 March 21:13 (source code was pulled right before the build).

(It is not the first time I noticed inconsistencies. I also had a case when the same query returned different results, which looked like a random subset of a bigger set. Unfortunately, I have not kept the binary files.)

lbuczkow commented 7 months ago

Additional information: the node.js script created NoteOf relationships in the following way:

    const s_ids = ids.join(',');
    const query = `
      MATCH (n:Note), (u:User {id: $user_id}) 
      WHERE n.id IN [${s_ids}]
      MERGE (n)-[:NoteOf]->(u);
    `;
    const stmt = await con.prepare(query);
    const result = await con.execute(stmt, { user_id: user_id });

andyfengHKU commented 7 months ago

Hi @lbuczkow thanks for reporting the bug.

Do u mind sharing the original files (csv, parquet or create statements) instead of database folder? After investigation, the bug is due to an inconsistent adjacency list index on the storage side. Consider your example

MATCH (n:Note)-[:NoteOf]->(u:User) WHERE n.id in ...

Result is correct if we extend in direction n -> u and incorrect if extend is in direction u -> n. Meaning our forward and backward adjacency lists are not consistent with each other.

As far as I can recall, this bug should be fixed by @ray6080 already (I couldn't find the specific PR for the fix so @ray6080 should comment in the thread). So I wonder if you have reloaded the data recently which hopefully can fix this bug. If it doesn't work, happy to investigate deeper with the raw files.

Best, Xiyang

ray6080 commented 7 months ago

This should be something different from the previous fix, which was done two weeks ago, before the "23 March".

It looks like the link to binary db file is no longer valid. Can you provide raw csv files or the full script of generating these csv files? I will take a look into it.

lbuczkow commented 7 months ago

@andyfengHKU: the relationship is only n->u. I have not intended to create u->n .

@ray6080: I do not have the original .csv files anymore but they can be recreated using:

COPY (MATCH (u:User) RETURN u.uid, u.id) TO 'user.csv';
COPY (MATCH (n:Note) RETURN n.uid, n.id) TO 'note.csv';
COPY (MATCH (n1:Note)-[:Links]->(n2:Note) RETURN n1.uid, n2.uid) TO 'links.csv';

I had compared the exported files with the original ones before deleting and they differed only in order of records.

I also uploaded them for convenience: https://file.io/Ek2sqzELM9ds

NoteOf has been created directly in the database using a .js script. I experimented a lot during the creation of this relationship and the script went through several modifications. Its execution was interrupted with Ctrl-C so it might contribute to the issue.

The most recent version of the script is:

// ../js/kuzu contains kuzujs.node built from source
// and the corresponding .js api files
const kuzu = require("../js/kuzu"); 
const db = new kuzu.Database("./notes");

async function task() {
  const con = new kuzu.Connection(db);
  let total = 0;
  const q0 = `
    MATCH (n:Note)
    WHERE NOT EXISTS { MATCH (n)-[:NoteOf]->(u:User)}
    RETURN n.id AS id;`;
  const st0 = await con.prepare(q0);
  const res0 = await con.execute(st0);
  const rows0 = await res0.getAll();
  const num_remaining = rows0.length;
  console.log(`remaining ${num_remaining} notes`);
  let remaining = [...rows0.map(r => r.id)];
  for (let user_id = 0; user_id < 1000; user_id++) {
    let ids = [];
    if (remaining.length == 0) {
      console.log('no more free notes');
      break;
    }
    const note_id = remaining[0];
    remaining = remaining.filter(it => it !== note_id);
    ids.push(note_id);
    const q2 = `
      MATCH path = (n0 {id: $note_id})-[* 1..6]->(n1:Note)
      WHERE NOT EXISTS { MATCH (n1)-[:NoteOf]->(:User) } AND n1.id <> $note_id
      RETURN DISTINCT n1.id AS id;
    `;
    const st2 = await con.prepare(q2);
    const res2 = await con.execute(st2, { note_id: note_id });
    const rows2 = await res2.getAll();
    const ids2 = rows2.map(r => r.id).filter((it, idx, arr) => arr.indexOf(it) === idx);
    ids.push(...ids2);
    const cnt = ids.length;
    total += cnt;
    const sids = ids.join(',');
    const q3 = `
      MATCH (n:Note), (u:User {id: $user_id}) 
      WHERE n.id IN [${sids}]
      MERGE (n)-[:NoteOf]->(u);
    `;
    const st3 = await con.prepare(q3);
    const res3 = await con.execute(st3, { user_id: user_id });
    remaining = remaining.filter(it => !ids.includes(it));
    console.log(`user=${user_id} note0=${note_id} cnt=${cnt} total=${total}`);
  }
}

(async () => {
  task();
})();

andyfengHKU commented 7 months ago

Hi @lbuczkow

I basically follow your approach to reproduce the csv files through

COPY (MATCH (u:User) RETURN u.uid, u.id) TO 'user.csv';
COPY (MATCH (n:Note) RETURN n.uid, n.id) TO 'note.csv';
// the following `noteOf.csv` has 999990 lines indeed.
COPY (MATCH (n1:Note)-[:NoteOf]->(n2:User) RETURN n1.uid, n2.uid) TO 'noteOf.csv';

And then recreate a new database directory with

CREATE NODE TABLE User(uid UUID, id INT64, PRIMARY KEY(uid)); 
CREATE NODE TABLE Note(uid UUID, id INT64, PRIMARY KEY(uid)); 
CREATE REL TABLE NoteOf(FROM Note TO User);
COPY User FROM 'user.csv';
COPY Note FROM 'note.csv';
COPY NoteOf FROM 'noteOf.csv';

Then

MATCH (n:Note) WHERE NOT EXISTS { MATCH (n)-[:NoteOf]->(:User) } RETURN count(n);

reports 10 which is expected now.

So I think you could fix be recreating database from scratch (which will apply our fix to the storage since you are using the latest master).

I'll also ask my colleague to execute js code to see if the bug still exist there or not. Will keep u updated.

lbuczkow commented 7 months ago

Yes, this is exactly what I did in order to continue with my tests and for a test scenario it is perfectly ok. But frequent exporting and importing entire database to fix inconsistencies is not an acceptable solution for a database used in production. It is no go for a database engine if a database becomes and remains internally inconsistent. This is why I reported the issue.

lbuczkow commented 7 months ago

I played with the .js script for 3 days before I realized there were inconsistencies, and there were only 10 of them among 1 million records. So reproducing the issue might not be easy.

andyfengHKU commented 7 months ago

It is no go for a database engine if a database becomes and remains internally inconsistent.

Totally agree. Data integrity is one of the most important aspect of database. So fixing these is always our top priority.

So reproducing the issue might not be easy.

We plan to rerun the script on a new database and then validate the storage data. This issue is almost identical with another user reported issue two weeks ago except that his was using python api. So I'm fairly confident that the fix will also apply to your case. Will let u know the result by tmr.

Thanks for reporting again!

ray6080 commented 7 months ago

Hi @lbuczkow , we've got what's wrong from the binary db files. trying to follow your steps to reproduce the problem from scratch. it seems the script takes quite a while to run, and we haven't finished running it yet. Is there a smaller example or simplified script available to reproduce this?

lbuczkow commented 7 months ago

The purpose of the script was to randomly assign groups of notes (neighbouring to each other) to users. A single run of the script takes about an hour and covers only a fraction of all 1000000 notes so it had to be run several times. I was executing it again and again for three days to complete the job. In the meantime I was modifying the logic of the script, but not the part which really makes changes to the database. I also interrupted the script several times (by pressing Ctrl-C). I can only suggest interrupting the script more frequently because I suspect the issue was caused by it, but it is just a pure guess. I do not have any other script to reproduce the issue.

ray6080 commented 7 months ago

okay. thanks for the info. A potential fix is here (https://github.com/kuzudb/kuzu/pull/3165). I will run the scripts a bit more rounds to see if I can verify the problem and the fix. will also see if I can reproduce it locally with some other random generations.

ray6080 commented 7 months ago

This should be fixed now in #3165.

andyfengHKU commented 6 months ago

Closed since the bug is fixed.

kuzudb / kuzu

Kuzu db can become inconsistent #3151