openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
867 stars 210 forks source link

Poor query performance after INSERT #1023

Open RyanOattes opened 2 years ago

RyanOattes commented 2 years ago

I have an application loading data into a new graph in Virtuoso using INSERT DATA queries for a total of about 8M triples. After that is done, I run a test query that spans 5 classes, does a COUNT, and returns about 3000 rows. (The details of the test query don't seem to matter, but this is the minimum level of complexity where the issue I'm describing below is obvious.) This is on a Virtuoso node with 4x16 Virtuoso node.

What I'm seeing is poor query performance (~15 seconds) until I restart the Virtuoso server at which point the performance reliably improves to ~1 second. This is reliably reproduceable and simply waiting (12 hours) after the INSERT's doesn't improve it. Restart is required.

I dug into it and found that the same query, on the same graph, before and after the server restart have wildly different query plans. I'm suspecting this is an issue of statistics generation to guide the query planner, but I can't find anything in the docs other than a reference to dropping and rebuilding indices. When I restart the server, it starts very quickly, so I assume whatever it's doing doesn't take long. Was hoping there is a "recalculate statistics" stored procedure I can run or something, but so far not finding it.

Am I missing a crucial step after inserting data?

TallTed commented 2 years ago

This is on a Virtuoso node with 4x16 Virtuoso node.

I don't understand the above sentence... Could you rephrase it?

The exact steps of your data load, and of your Virtuoso restart, may provide an explanation of your experience, which is not typical. Examination of Virtuoso's active INI and LOG files, along with key details of the runtime environment (such as total system RAM; RAM reserved for Virtuoso; storage device types [e.g., HDD vs SSD], interfaces [e.g., SCSI vs IDE/PATA vs SATA], and devices-per-controller count) may also reveal clues.

RyanOattes commented 2 years ago

Sorry, mangled that sentence. I meant to indicate that Virtuoso has 4cpu and 16GB ram.

Virtuoso Detail:

Virtuoso is running on Kubernetes with this image: openlink/virtuoso-opensource-7:latest The pod has 4 cpu and 14000Mi memory assigned. Storage is an EBS persistent volume. The underlying EC2 compute node is 8 cpu, 32GB ram. Virtuoso.ini is attached.

Process Details:

To load the data, we are calling the /sparql-auth interface with INSERT DATA queries in batches of 5000 (each service call contains 100 INSERT DATA statements) until we've ingested 8 million triples. The test queries are also run via the /sparql-auth interface. When I say I restarted Virtuoso, I'm referring to restarting the Kubernetes pod it's running in, by scaling its kubernetes deployment from 1 to 0 and back to 1. Log from last two restarts pasted below.

        Tue Apr 05 2022
14:59:45 { Loading plugin 1: Type `plain', file `wikiv' in `/opt/virtuoso-opensource/hosting'
14:59:45   FAILED  plugin 1: Unable to locate file }
14:59:45 { Loading plugin 2: Type `plain', file `mediawiki' in `/opt/virtuoso-opensource/hosting'
14:59:45   FAILED  plugin 2: Unable to locate file }
14:59:45 { Loading plugin 3: Type `plain', file `creolewiki' in `/opt/virtuoso-opensource/hosting'
14:59:45   FAILED  plugin 3: Unable to locate file }
14:59:45 OpenLink Virtuoso Universal Server
14:59:45 Version 07.20.3233-pthreads for Linux as of Jun 22 2021
14:59:45 uses OpenSSL 1.0.2u  20 Dec 2019
14:59:45 uses parts of PCRE, Html Tidy
14:59:48 Database version 3126
14:59:48 Unlinked the temp db file /data/virtuoso/virtuoso-temp.db as its size (12MB) was greater than TempDBSize INI (10MB)
14:59:48 SQL Optimizer enabled (max 1000 layouts)
14:59:49 Compiler unit is timed at 0.000125 msec
14:59:51 Roll forward started
14:59:51 Roll forward complete
14:59:52 Checkpoint started
14:59:52 Checkpoint finished, log reused
14:59:52 HTTP/WebDAV server online at 8890
14:59:52 Server online at 1111 (pid 1)
15:00:05 * Monitor: Locks are held for a long time
15:00:40 Can't open file /opt/virtuoso-opensource/database//sesn7WJ6i, error 30
15:00:41 Can't open file /opt/virtuoso-opensource/database//ses0T2bjR, error 30
15:01:33 Can't open file /opt/virtuoso-opensource/database//sesBNpiSG, error 30
15:30:08 Server received signal 2
15:30:08 Initiating normal shutdown
15:30:09 Checkpoint started
15:30:09 Checkpoint finished, log reused
15:30:09 Server shutdown complete

        Tue Apr 05 2022
15:30:41 { Loading plugin 1: Type `plain', file `wikiv' in `/opt/virtuoso-opensource/hosting'
15:30:41   FAILED  plugin 1: Unable to locate file }
15:30:41 { Loading plugin 2: Type `plain', file `mediawiki' in `/opt/virtuoso-opensource/hosting'
15:30:41   FAILED  plugin 2: Unable to locate file }
15:30:41 { Loading plugin 3: Type `plain', file `creolewiki' in `/opt/virtuoso-opensource/hosting'
15:30:41   FAILED  plugin 3: Unable to locate file }
15:30:41 OpenLink Virtuoso Universal Server
15:30:41 Version 07.20.3233-pthreads for Linux as of Jun 22 2021
15:30:41 uses OpenSSL 1.0.2u  20 Dec 2019
15:30:41 uses parts of PCRE, Html Tidy
15:30:44 Database version 3126
15:30:44 Unlinked the temp db file /data/virtuoso/virtuoso-temp.db as its size (82MB) was greater than TempDBSize INI (10MB)
15:30:44 SQL Optimizer enabled (max 1000 layouts)
15:30:45 Compiler unit is timed at 0.000124 msec
15:30:46 Roll forward started
15:30:46 Roll forward complete
15:30:47 Checkpoint started
15:30:47 Checkpoint finished, log reused
15:30:47 HTTP/WebDAV server online at 8890
15:30:47 Server online at 1111 (pid 1)

virtuoso.ini.txt

TallTed commented 2 years ago

From a quick look over things, I note you're running from the latest stable/7 codebase, which is about 9 months old. There are a raft of improvements in the latest develop/7 codebase, which is being prepped for a new stable/7 release, but that's still some days away at best. I'm not aware of any commits that are directly related to your experience, but you may want to test with a newer build.

Next, I'm wondering whether you're running a CHECKPOINT(); after the data inserts. It appears not -- and that could explain your experience, though it would be a bit surprising, especially as you've said that waiting 12 hours doesn't make a difference, and your INI file has CheckpointInterval = 60 (minutes), so you should see 11 checkpoints in such a log. This call cannot currently be executed through any of the /sparql interfaces; it must be run through a SQL-based tool, like the iSQL implementations, or your preferred ODBC-, JDBC-, or ADO.NET-based tool. It may be worth a quick manual execution after data load, just as a test.

I also notice a few errors in the first log, like Can't open file /opt/virtuoso-opensource/database//sesn7WJ6i, error 30. I'm guessing these are related to your data load? Probably not directly related to this issue, but possibly something you'll want to look into.

Please let me know if any of the above resolves the issue.

RyanOattes commented 2 years ago

No luck so far. The checkpoint didn't make a difference. I haven't tried the develop/7 codebase, but I'll do that next.

I did, however, confirm my earlier observation about the query plans being different. I ingested one graph, restarted virtuoso, then ingested another graph. Running the same query against both (modifying for IRI's) yeilded totally different SQL and query plans. The plans are greek to me, but I'm attaching them here in case they point to anything for you.

In conductor, I looked up the stats for both graphs, and they are identical. (triples, classes, entities, etc) Is there some other piece of metadata I should be looking for? Is it possible the indexes on RDF_QUAD/RDF_IRI aren't getting updated?

Ingest7Good.txt Ingest6Bad.txt

TallTed commented 2 years ago

Interesting. There does appear to be something weird going on. I don't think it's the indexes, as they aren't automatically rebuilt/updated on shutdown nor on startup. If your test with a build from the current develop/7 codebase doesn't resolve this, I think it will need a deeper look by Development.

RyanOattes commented 2 years ago

Found a few more details, but no answers yet. Some background: My app can begin an ingestion job in a couple of ways based on what kind of changes in configuration it detects since the last run. If it has a clear idea of data that needs to be removed before ingesting, it runs some targeted DELETE SPARQL queries first then INSERT DATA queries. If there are changes it doesn't fully support, it runs CLEAR GRAPH, then INSERT DATA. The cases I'm experiencing this issue on are all new ingests (no graph exists in Virtuoso to begin with), but due to various configurations some start with DELETE and some start with CLEAR.

I've discovered that the CLEAR cases all work fine, and the DELETE cases all work poorly. In both cases, the graph is created in Virtuoso and contains identical data. Counts, stats, etc, all match. I also found today that if I take a poorly performing graph URI (created with the DELETE, then INSERT method), delete the graph in Conductor, then rerun the process using the CLEAR method it STILL doesn't work.

Re-stating the above without the app-specific details: Two methods produce the same graph, but with different query performance. If I take a poorly performing graph, delete it, and recreate it with the good method, it still doesn't work well. It seems something "sticky" happens when the graph URI is created the first time. And, of course, all of this is resolved by bouncing the server.

As for the develop/7 code — I don't see any builds on the develop branches in dockerhub. Are there images available?

TallTed commented 2 years ago

We do not currently publish develop/7-based docker images, nor pre-built develop/7 binaries, so I think our team will need to do further testing with this locally.

Would you mind getting a test license and trying with Enterprise Edition v8? We do publish Enterprise Edition v8 docker images; the main difference in what you need to do to make things work is to put that license file in the database/ directory alongside the virtuoso.ini on your local filesystem, so the docker image can pick it up on the next startup. The Enterprise Edition v8 codebase is substantially ahead of the develop/7 branch, and if this issue is not seen there, we clearly have a fix, and just need to backport it to the develop/7 branch; if this issue persists there, the importance of its resolution increases substantially.

The specifics of the data you're working with may matter, as surprising as this may be. Is there any chance you can provide some sample data for us to work with? (If you can reproduce the issue with a public dataset that we can also acquire, that would also serve the purpose.) Basically, the more specific the data and action sequence you can provide, the faster we can test locally, and thence find a resolution.

RyanOattes commented 2 years ago

Sorry for the delay. We have a release happening this week and I haven't been able to try v8 yet. Will do that and let you know.

Whether that resolves or doesn't resolve the issue, I can provide the data we're using. It's a set we use for performance testing our product at customer installations, so nothing private in it. I could provide you access to a Virtuoso instance with the data already ingested, but if you're rather it in TTL files full of INSERT DATA queries instead, I'll have to script something to generate that.

imitko commented 2 years ago

@RyanOattes

In addition to the query execution plan, you can do PROFILE([query_text]) to get timings on each step of the plan. An explanation and example can be found in this article.

Another way to do a SQL/SPARQL query profile is to connect to the database with command-line isql tool, then do following sequence:

SQL> set profile on;
SQL> sparql [sparql query here];
SQL> set profile off;

Some statistics about datasets would also help to identify problem:

Note that DELETE does not completely erase the RDF graph structure; to remove completely, you need to use CLEAR GRAPH.