kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.09k stars 80 forks source link

Fix bugs in Connection::interrupt #1828

Closed benjaminwinger closed 10 months ago

benjaminwinger commented 10 months ago

This is related to one of the failing tests in #1825, CApiConnectionTest.Interrupt. But this still doesn't seem to fix it as I'm still getting a hung process when running it with asan locally. (Fixed) However it was originally also sometimes segfaulting when I tested it locally, and that doesn't happen any more.

Previously, interrupting before a query has executed would mean the activeQuery would be null when interrupt tries to set its interrupted flag. Additionally, the detached thread in the test might not call the interrupt function until after the connection has been destroyed.

I removed the activeQuery entirely since assigning to a unique_ptr is not atomic, so attempting to interrupt when a query starts might cause it to write to the old memory location after it has been deleted and replaced with a new activeQuery object (unlikely, but I think is possible if a context switch occurs at the wrong time). And it only seemed to be used to initialize the interrupted flag and the timer, which could be done from a simple function instead without the possibility of memory errors. (re-added following feedback on slack)

Note that it's still possible for there to be a data race if you interrupt at around the same time the interrupted field gets reset when the query starts. That could probably be handled by providing a return value indicating if the interruption occurred when the query is actually running.

codecov[bot] commented 10 months ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (1e9d4fd) 91.19% compared to head (883134a) 91.20%.

:exclamation: Current head 883134a differs from pull request most recent head 50cd096. Consider uploading reports for the commit 50cd096 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1828 +/- ## ======================================= Coverage 91.19% 91.20% ======================================= Files 779 779 Lines 28606 28605 -1 ======================================= Hits 26088 26088 + Misses 2518 2517 -1 ``` | [Files Changed](https://app.codecov.io/gh/kuzudb/kuzu/pull/1828?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kuzudb) | Coverage Δ | | |---|---|---| | [src/main/connection.cpp](https://app.codecov.io/gh/kuzudb/kuzu/pull/1828?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kuzudb#diff-c3JjL21haW4vY29ubmVjdGlvbi5jcHA=) | `93.20% <ø> (-0.03%)` | :arrow_down: | | [src/include/main/client\_context.h](https://app.codecov.io/gh/kuzudb/kuzu/pull/1828?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kuzudb#diff-c3JjL2luY2x1ZGUvbWFpbi9jbGllbnRfY29udGV4dC5o) | `100.00% <100.00%> (ø)` | | | [src/main/client\_context.cpp](https://app.codecov.io/gh/kuzudb/kuzu/pull/1828?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kuzudb#diff-c3JjL21haW4vY2xpZW50X2NvbnRleHQuY3Bw) | `92.30% <100.00%> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/kuzudb/kuzu/pull/1828/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kuzudb)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

benjaminwinger commented 10 months ago

I think this functionally makes CI worse since it now always times out instead of crashing and ending quickly.

benjaminwinger commented 10 months ago

With asan I think the 100ms delay before interruption is too little. But any delay could be too little with a high enough load on the machine, so I set it to interrupt repeatedly until the query exits.

acquamarin commented 10 months ago

With asan I think the 100ms delay before interruption is too little. But any delay could be too little with a high enough load on the machine, so I set it to interrupt repeatedly until the query exits.

I am a little confused. If the user issues an interrupt to a connection, the interruptted flag will be set to true. When the query is running, the Sink::getNextTuple() will constantly check whether the interrupt flag is true or not. So i don't think the interrupt will fail if we interrupt the query too early.

benjaminwinger commented 10 months ago

The interrupt function will fail to have any effect if it interrupts before the interrupt flag gets reset to false when the query starts. That's as far as I can tell why the interrupt test kept hanging with asan, as between debug mode and asan it takes longer to do everything (before this patch it was crashing, since in the same situation it was dereferencing a null pointer).

Maybe it would make more sense to reset the interrupt flag at the end of the query instead of the beginning. Then the only possible data race would be if you interrupt between when the query has finished and when it resets the interrupt flag, in which case it shouldn't really matter since there's nothing to interrupt (at least, unless you're trying to interrupt the next query). But I assume that in real world use-cases you would only try to interrupt once the query has started and is obviously hung or taking a long time, so it's probably not something that will come up outside of tests.

acquamarin commented 10 months ago

I think for the c interrupt test, we should swap the code for interrupt with the code that executes the query. I mean should we place the kuzu_connection_query before the interrupt code?

/ This may happen too early, so try again until the query function finishes.
    std::thread t([&connection, &finished]() {
        do {
            std::this_thread::sleep_for(std::chrono::milliseconds(100));
            kuzu_connection_interrupt(connection);
        } while (!finished);
    });
    t.detach();
    auto result = kuzu_connection_query(
        connection, "MATCH (a:person)-[:knows*1..28]->(b:person) RETURN COUNT(*);");
    finished = true;
benjaminwinger commented 10 months ago

I think for the c interrupt test, we should swap the code for interrupt with the code that executes the query. I mean should we place the kuzu_connection_query before the interrupt code?

kuzu_connection_query blocks, and the query seems to be designed to run for an extremely long time (hence the need for a successful interrupt). It needs to run in a separate thread, and be started beforehand, so that it can run while the query is executing.