openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
852 stars 212 forks source link

Virtuoso hangs after bulk load #670

Open borntyping opened 7 years ago

borntyping commented 7 years ago

Hi,

I've been having some problems using the bulk loader. After finishing a load (or sometimes before finishing), the virtuoso-t process hangs and stops responding to all clients and signals. This happens around 10-30 minutes after starting the load (seems to usually be 10, 20 or 30 minutes, but I've not timed it accurately enough). Following the process with strace shows virtuoso-t and every child process waiting on a futex call.

I've tested this across various configurations (built from releases 7.2.1, 7.2.2, and 7.2.4; on CentOS 6 and 7; with set isolation='uncommitted'; and checkpoint_interval(-1);). The datasets being loaded are across multiple graphs, with mix of small and large .ttl and .nt.gz files totaling ~1.6GB. The main version I've tested is 7.2.1 on CentOS 6, which we have previously had success with:

$ virtuoso-t -?
Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.1.3214-pthreads as of Jul 16 2015
Compiled for Linux (x86_64-unknown-linux-gnu)
Copyright (C) 1998-2015 OpenLink Software
$ cat /etc/redhat-release 
CentOS release 6.6 (Final)
$ uname -a
Linux virtuoso-centos6.gb.local 2.6.32-504.el6.x86_64 #1 SMP Wed Oct 15 04:27:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Enabling TraceOn output hasn't resulted in any useful output at the time it hangs. Monitoring CPU usage shows that it stops doing any work on the CPU at the time it hangs (when it happens during the load), or has a short CPU spike before hanging (when the load has already finished).

I've tried both using the rdf_loader_run methods following the documentation (method 1 below), and using rdf_loader_run to load files one-by-one (method 2 below).

Load method 1 ### `data.isql` ```bash ld_dir('/data/chembl', '*s.nt.gz', 'http://example.com/chembl'); ld_dir('/data/chembl', 'chembl-shapes.ttl', 'http://example.com/shapes'); ld_dir('/data/chembl', 'version.nt.gz', 'http://example.com/shapes'); ld_dir('/data/GeneOntology', 'gene-ontology-from-obo.nt.gz', 'http://example.com/ontology/go'); ld_dir('/data/GeneOntology', 'version.nt.gz', 'http://example.com/shapes'); ld_dir('/data/PDB', '*.nt.gz', 'http://example.com/pdb'); ld_dir('/data/PDB', 'pdb-shapes.ttl', 'http://example.com/shapes'); ld_dir('/data/taxonomy', 'tax*.nt.gz', 'http://example.com/ontology/taxon'); ld_dir('/data/taxonomy', 'version.nt.gz', 'http://example.com/shapes'); ``` ### `load.sh` ```bash isql-v localhost dba dba "data.isql" isql-v localhost dba dba "exec=SELECT ll_file FROM DB.DBA.load_list WHERE ll_state <> 2;" for i in {1..2}; do isql-v localhost dba dba "exec=rdf_loader_run(log_enable => 2);" & done wait isql-v localhost dba dba "exec=checkpoint;" ```
Load method 2 ### `load.isql` ```bash ld_dir('/data/chembl', '*s.nt.gz', 'http://example.com/chembl'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/chembl', 'chembl-shapes.ttl', 'http://example.com/shapes'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/chembl', 'version.nt.gz', 'http://example.com/shapes'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/GeneOntology', 'gene-ontology-from-obo.nt.gz', 'http://example.com/ontology/go'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/GeneOntology', 'version.nt.gz', 'http://example.com/shapes'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/PDB', '*.nt.gz', 'http://example.com/pdb'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/PDB', 'pdb-shapes.ttl', 'http://example.com/shapes'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/taxonomy', 'tax*.nt.gz', 'http://example.com/ontology/taxon'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ld_dir('/data/taxonomy', 'version.nt.gz', 'http://example.com/shapes'); SET isolation='uncomitted'; rdf_loader_run(log_enable => 2); checkpoint; COMMIT WORK; checkpoint; ```

I'm sorry this is so vague - I've had very little success working out why this is happening, so any advice on what I might be doing wrong or could look at next would be appreciated. I can try and collect more data if needed.

Thanks :sweat_smile:

HughWilliams commented 7 years ago

I assume you have performance tuned your Virtuoso instance as detailed at http://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning ?

When the Virtuoso server is still accessible what does the output of running the "status();" command report as to the status of the server at that point ?

What is is the "virtuoso.log" file from the point the load start to the hang ?

In your second load method why do you 'SET isolation='uncomitted';" as we normally run with the default of read committed ?

Note also that rdf_loader_run() run with log_enable => 2 by default ...

borntyping commented 7 years ago

I assume you have performance tuned your Virtuoso instance as detailed at http://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning ?

I've been following http://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning and http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader, using the recommended NumberOfBuffers for VMs with 16GB and 64GB of memory as appropriate.

When the Virtuoso server is still accessible what does the output of running the "status();" command report as to the status of the server at that point ? What is is the "virtuoso.log" file from the point the load start to the hang ?

Last successful status(); output and virtuoso.log from a load of ~157MB data (compressed), where the load completed and virtuoso then hung: https://gist.github.com/borntyping/292af84d32389f2621dc575dd7fb8aec

No output at or after the time it stopped responding (around ~13:38). Includes a lot of non-notable output from queries intended to collect data on the database's status (i.e. calls to status and queries of the DB.DBA.load_list table.

In your second load method why do you 'SET isolation='uncomitted';" as we normally run with the default of read committed ? Note also that rdf_loader_run() run with log_enable => 2 by default ...

Alongside following the documentation I've been following the steps taken by a legacy internal codebase in case I missed something. I can probably discard that now.

HughWilliams commented 7 years ago

The virtuoso log and status() output do not indicate possible cause of the problem ...

Is the dataset being loaded a public dataset, as if this can be made available we could attempt a local recreation of the issue ?

borntyping commented 6 years ago

We've finally found a workaround for the issue we've been having, though we don't have any idea why we're seeing Virtuoso behave like this: if we run the "status();" function via ISQL during a load, Virtuoso becomes unresponsive at some later point in the load (i.e. after "status();" has returned it's data successfully). This bug seems to be consistent for us, with loads usually failing quite quickly (within an hour).

We aren't easily able to provide a dataset to recreate this with, but have had this problem with multiple different datasets. As we've found a workaround for this, feel free to close the ticket if you can't recreate this.

HughWilliams commented 6 years ago

Hi Sam,

Are you saying the hang only occurs sometime after the "status();" command is run, and if you thus don't run the status() command the load is consistently successful ?

borntyping commented 6 years ago

Yes, that's exactly it. We're not sure why but this has been consistent across all our recent/recorded attempts to load data.