oxigraph / oxigraph

SPARQL graph database
Apache License 2.0
998 stars 67 forks source link

Another strangely slow query #926

Open joepie91 opened 1 month ago

joepie91 commented 1 month ago

Describe the bug The following query:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX task: <http://srap.cryto.net/rdf/task/1.0/>
PREFIX error: <http://srap.cryto.net/rdf/error/1.0/>
PREFIX anon: <http://srap.cryto.net/rdf/object/1.0/>
PREFIX item: <http://srap.cryto.net/rdf/item/1.0/>
PREFIX test: <http://seekseek.org/rdf/test/>

SELECT *
WHERE {
        VALUES (?item ?task) {
                (test:anon:0b345f1e-4ea4-4ee0-a34c-e691c20ac74c test:normalizeItem)
        } .
        ?taskRun task:uri ?task .
        ?taskRun task:item ?item .
} LIMIT 10

... is considerably slower than I'd expect:

Node {
  name: "Limit(10)",
  number of results: 1,
  duration in seconds: 5.545069196 s,
  children: [Node {
    name: "Project(?item, ?task, ?taskRun)",
    number of results: 1,
    duration in seconds: 5.545061422 s,
    children: [Node {
      name: "ForLoopJoin",
      number of results: 1,
      duration in seconds: 5.545058637 s,
      children: [Node {
        name: "StaticBindings(?item, ?task)",
        number of results: 1,
        duration in seconds: 1.272 µs
      }, Node {
        name: "ForLoopJoin",
        number of results: 1,
        duration in seconds: 5.545047265 s,
        children: [Node {
          name: "QuadPattern(?taskRun <http://srap.cryto.net/rdf/task/1.0/uri> ?task DEFAULT)",
          number of results: 457325,
          duration in seconds: 497.741566 ms
        }, Node {
          name: "QuadPattern(?taskRun <http://srap.cryto.net/rdf/task/1.0/item> ?item DEFAULT)",
          number of results: 1,
          duration in seconds: 4.425319405 s
        }]
      }]
    }]
  }]
}

If I understand the query plan correctly, it seems that it does a loop join on the two patterns first (with an ungrounded ?item and ?task) and only then filters the result set through the static bindings, while it would be much faster to start with the static bindings and only then apply the patterns.

Oddly, a similarly structured query elsewhere in my codebase:


PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX task: <http://srap.cryto.net/rdf/task/1.0/>
PREFIX error: <http://srap.cryto.net/rdf/error/1.0/>
PREFIX anon: <http://srap.cryto.net/rdf/object/1.0/>
PREFIX item: <http://srap.cryto.net/rdf/item/1.0/>
PREFIX test: <http://seekseek.org/rdf/test/>

SELECT ?tag ?item WHERE {
        VALUES (?item) {
                (test:a)
                (test:b)
                (test:c)
        } .
        ?item item:tag ?tag .
}

... does behave as I would expect:

Node {
  name: "Project(?tag, ?item)",
  number of results: 0,
  duration in seconds: 151.078 µs,
  children: [Node {
    name: "ForLoopJoin",
    number of results: 0,
    duration in seconds: 143.874 µs,
    children: [Node {
      name: "StaticBindings(?item)",
      number of results: 3,
      duration in seconds: 1.384 µs
    }, Node {
      name: "QuadPattern(?item <http://srap.cryto.net/rdf/item/1.0/tag> ?tag DEFAULT)",
      number of results: 0,
      duration in seconds: 126.02 µs
    }]
  }]
}

It appears that the addition of a second pattern somehow knocks the static bindings out of the first step in the query plan?

To Reproduce Steps to reproduce the behavior:

  1. Running oxigraph_server 0.3.22, on NixOS
  2. (See above query)
Tpt commented 1 month ago

If I understand the query plan correctly, it seems that it does a loop join on the two patterns first (with an ungrounded ?item and ?task) and only then filters the result set through the static bindings, while it would be much faster to start with the static bindings and only then apply the patterns.

I am not sure this is correct, the way Node { name: "ForLoopJoin", children: [A, B]} is evaluated is:

for t in A:
   yield from B(t)

So, ?taskRun task:uri ?task ; task:item ?item . should be evaluated with the variables in the VALUES clause bounded.

joepie91 commented 1 month ago

Thanks, that bit of information helped me figure out what's actually going on here: the query is unexpectedly slow, but not for the reason I thought. Rephrasing the structure of how the query originally got planned:

For each static binding, select all task runs with a matching task, then filter those down to only the ones with a matching item.

Because in my dataset of task runs there are vastly more matching tasks for a given task, than there are matching items for a given item (each item will usually have a single-digit amount of tasks associated with it, but those tasks are similar across all items ), that means that it is considering a lot of potential taskRuns for entirely different items, and that's why it's slow.

Swapping around the task and item patterns, the query time drops down from ~6s to 0.008s - as now it immediately selects only the handful of relevant items:

http: //seekseek.org/rdf/test/anon:0b345f1e-4ea4-4ee0-a34c-e691c20ac74c,http://seekseek.org/rdf/test/normalizeItem,http://srap.cryto.net/rdf/object/1.0/02019284-4233-4d3d-96e9-75bf7eb7be53
  Node {
    name: "Limit(10)",
    number of results: 1,
    duration in seconds: 309.771 µs,
    children: [Node {
      name: "Project(?item, ?task, ?taskRun)",
      number of results: 1,
      duration in seconds: 302.447 µs,
      children: [Node {
        name: "ForLoopJoin",
        number of results: 1,
        duration in seconds: 301.102 µs,
        children: [Node {
          name: "StaticBindings(?item, ?task)",
          number of results: 1,
          duration in seconds: 992n s
        }, Node {
          name: "ForLoopJoin",
          number of results: 1,
          duration in seconds: 298.216 µs,
          children: [Node {
            name: "QuadPattern(?taskRun <http://srap.cryto.net/rdf/task/1.0/item> ?item DEFAULT)",
            number of results: 2,
            duration in seconds: 214.158 µs
          }, Node {
            name: "QuadPattern(?taskRun <http://srap.cryto.net/rdf/task/1.0/uri> ?task DEFAULT)",
            number of results: 1,
            duration in seconds: 75.253 µs
          }]
        }]
      }]
    }]
  }

So, concretely, this issue can be resolved in my code by swapping the two patterns, to prefer the more selective pattern first.

From an Oxigraph perspective, I'm not sure what the goals are for the query planner design, and how clever it is meant to be - is it supposed to catch cases like this, and reorder the patterns so that it puts the more selective pattern first, or is that meant to be the job of the query author? I guess that depending on the answer to that, this may or may not be a query planner bug.

(As an aside, it may be helpful to document a bit more about how to read the query planner output, and how it internally schedules different operations - I was unable to find this information in the documentation, and it would have probably helped to catch this sooner :p)

joepie91 commented 1 month ago

On a possibly(?) related note, a single-triple INSERT that uses the aforementioned query (the fast version) as its WHERE clause still continues to take about 3 seconds to evaluate, eating a CPU core during that time, and I'm unable to find a way to profile this.

The query in question:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX task: <http://srap.cryto.net/rdf/task/1.0/>
PREFIX error: <http://srap.cryto.net/rdf/error/1.0/>
PREFIX anon: <http://srap.cryto.net/rdf/object/1.0/>
PREFIX item: <http://srap.cryto.net/rdf/item/1.0/>
PREFIX test: <http://seekseek.org/rdf/test/>

INSERT {
    ?item task:invalidated true
}
WHERE {
    VALUES (?item ?task) {
        (test:anon:1c4ddd25-672f-44a3-80ed-e2aeba4e9c03 test:normalizeItem)
    } .
    ?taskRun task:item ?item .
    ?taskRun task:uri ?task .
}

(The evaluation time is constant regardless of the specific ?item as far as I can tell)

This may be a different issue entirely, but I'm not sure anything meaningful can be said about it without a way to profile it.

Tpt commented 1 month ago

From an Oxigraph perspective, I'm not sure what the goals are for the query planner design, and how clever it is meant to be - is it supposed to catch cases like this, and reorder the patterns so that it puts the more selective pattern first, or is that meant to be the job of the query author? I guess that depending on the answer to that, this may or may not be a query planner bug.

On the long term, the query planner should be able to figure what is the most selective pattern and execute it first. However, Oxigraph is far from being there yet, the query planner is very basic. Currently, it executes patterns with more variable bound first and on a tie relies on the query order. For example "?s a ex:Foo" is going to be evaluated before "?s ex:p ?o" but "?s ex:p1 ?o" and "?s ex:p2 ?o" are considered similar and not reordered. I hope to fix that in the future using statistics but it's not done yet.

(As an aside, it may be helpful to document a bit more about how to read the query planner output, and how it internally schedules different operations - I was unable to find this information in the documentation, and it would have probably helped to catch this sooner :p)

Indeed. We should document that more.

On the update: the performance is weird. Maybe it's because of transactional locking.

This may be a different issue entirely, but I'm not sure anything meaningful can be said about it without a way to profile it.

Good point. I should add explanations to update (more work to do...).

joepie91 commented 1 month ago

(more work to do...)

On that note, if I wanted to contribute to Oxigraph, are there any good first issues to work on, so that I can become familiar with the codebase? The 'good first issues' tag is currently looking pretty empty :)

(Can't hard-commit to contributing yet at this time, will have to give it a shot first with a relatively simple task and see whether my brain cooperates with me on that)

Tpt commented 1 month ago

On that note, if I wanted to contribute to Oxigraph, are there any good first issues to work on, so that I can become familiar with the codebase? The 'good first issues' tag is currently looking pretty empty :)

Amazing! I do not have much easy task for beginners. Based on your interest I see:

  1. Contribute to the NodeJS bindings. The current ones are very basic and rely on WASM. Something with Neon/NAPI.rs might be very welcome! I am not a big user of them so they have been left quite unmaintained.
  2. Contribute some optimization to sparopt. The optimizer is pretty naive and improving it should not be too hard. There are also already a few tests covering it.
  3. Add a explain_update_opt function to get explanations of SPARQL updates