openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
863 stars 210 forks source link

How to export all nodes/edges of the RDF graph? #809

Open szarnyasg opened 5 years ago

szarnyasg commented 5 years ago

I am working on graph database benchmarking and graph analytics as a member of the Linked Data Benchmark Council (cc @mirkospasic who has been a member) and also collaborate with @saleem-muhammad of the AKSW group.

I tried to export all nodes and edges for graph analysis from benchmark datasets that are available as Virtuoso instances. I used the isql interface though the following Bash script:

echo "SPARQL SELECT DISTINCT ?v WHERE { { ?s ?p ?v . FILTER (isIRI(?v) && ?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>) } UNION { ?v ?p ?o . FILTER (isIRI(?v) && ?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>) } };" | ./isql -S $PORT | tail -n +9 | head -n -3 > ../../$DB-nodes.csv
echo "SPARQL SELECT DISTINCT ?s ?p ?o WHERE { ?s ?p ?o . FILTER (isIRI(?o) && ?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>)};" | ./isql -S $PORT | tail -n +9 | head -n -3 | sed "s/\s\+/\t/g" > ../../$DB-edges.csv

Here are the SPARQL queries formatted for readability.

Query for extracting nodes:

SPARQL SELECT DISTINCT ?v
WHERE {
  {
    ?s ?p ?v .
    FILTER (isIRI(?v) && ?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>)
  } UNION {
    ?v ?p ?o .
    FILTER (isIRI(?v) && ?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>)
  }
};

Query for extracting edges:

SPARQL SELECT DISTINCT ?s ?p ?o
WHERE {
  ?s ?p ?o .
  FILTER (isIRI(?o) && ?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>)
};

I ran into out-of-memory issues when exporting edges for the BioBench dataset, which consists of approx 1.45B triples. I used two cloud VMs for my experiments:

In both cases, I increased the number of buffers in virtuoso.ini by scaling up the recommended values to the available system memory.

For 240 GB system RAM:

NumberOfBuffers           = 15000000
MaxDirtyBuffers           = 11500000

For 750 GB system RAM:

NumberOfBuffers           = 40000000
MaxDirtyBuffers           = 30000000

Are there any ways to get these queries working? Alternatively, if I could dump the entire database to an nt file, that'd work perfectly well for my use case - I could simply pipe the file through grep and get desired triples.

The log for the crash is below:

                Sun Nov 18 2018         
09:44:24 INFO: { Loading plugin 1: Type `plain', file `wikiv' in `../lib/virtuoso/hosting'
09:44:24 ERROR:   FAILED  plugin 1: Unable to locate file }
09:44:24 INFO: { Loading plugin 2: Type `plain', file `mediawiki' in `../lib/virtuoso/hosting'
09:44:24 ERROR:   FAILED  plugin 2: Unable to locate file }
09:44:24 INFO: { Loading plugin 3: Type `plain', file `creolewiki' in `../lib/virtuoso/hosting'
09:44:24 ERROR:   FAILED  plugin 3: Unable to locate file }
09:44:24 INFO: OpenLink Virtuoso Universal Server
09:44:24 INFO: Version 07.20.3217-pthreads for Linux as of Dec 15 2017
09:44:24 INFO: uses parts of OpenSSL, PCRE, Html Tidy                         
09:45:00 INFO: Database version 3126                                       
09:45:00 INFO: SQL Optimizer enabled (max 1000 layouts)
09:45:01 INFO: Compiler unit is timed at 0.000154 msec
09:45:29 INFO: Roll forward started
09:45:29 INFO: Roll forward complete
09:45:34 INFO: Checkpoint started
09:45:35 INFO: Checkpoint finished, log reused
09:45:35 INFO: HTTP/WebDAV server online at 8887
09:45:35 INFO: Server online at 1109 (pid 19518)
10:45:37 INFO: Write load very high relative to disk write throughput.  Flushing at     7e+03 MB/s while application is making dirty pages at     7e+03 MB/s. To checkpoint the database, will now pause the workload with 5536 MB unflushed.
10:45:42 INFO: Checkpoint started
10:45:43 INFO: Checkpoint finished, log reused
11:45:45 INFO: Write load very high relative to disk write throughput.  Flushing at   6.9e+03 MB/s while application is making dirty pages at   6.9e+03 MB/s. To checkpoint the database, will now pause the workload with 5754 MB unflushed.
11:45:50 INFO: Checkpoint started
11:45:51 INFO: Checkpoint finished, log reused
12:45:53 INFO: Write load very high relative to disk write throughput.  Flushing at   1.6e+04 MB/s while application is making dirty pages at   1.6e+04 MB/s. To checkpoint the database, will now pause the workload with 14063 MB unflushed.
12:46:04 INFO: Checkpoint started
12:46:05 INFO: Checkpoint finished, log reused
13:46:08 INFO: Write load very high relative to disk write throughput.  Flushing at     1e+04 MB/s while application is making dirty pages at     1e+04 MB/s. To checkpoint the database, will now pause the workload with 8673 MB unflushed.
13:46:15 INFO: Checkpoint started
13:46:16 INFO: Checkpoint finished, log reused
14:46:19 INFO: Write load very high relative to disk write throughput.  Flushing at   3.6e+04 MB/s while application is making dirty pages at   3.6e+04 MB/s. To checkpoint the database, will now pause the workload with 30873 MB unflushed.
14:47:23 ERROR: Memory low! Using memory reserve to terminate current activities properly
14:47:23 ERROR: Current location of the program break 123369119744
14:47:23 ERROR: Current location of the program break 123369119744
14:47:23 INFO: ./virtuoso-t() [0x8d4b6d]
14:47:23 INFO: ./virtuoso-t() [0x8d4be8]
14:47:23 INFO: ./virtuoso-t() [0x8dc135]
14:47:23 INFO: ./virtuoso-t() [0x8c1776]
14:47:23 INFO: ./virtuoso-t() [0x8c2a7e]
14:47:23 INFO: ./virtuoso-t() [0x492706]
14:47:23 INFO: ./virtuoso-t() [0x493319]
14:47:23 INFO: ./virtuoso-t() [0x515c54]
14:47:23 INFO: ./virtuoso-t(setp_node_input+0x13) [0x5162b3]
14:47:23 INFO: ./virtuoso-t() [0x5c5823]
14:47:23 INFO: ./virtuoso-t() [0x5c5823]
14:47:23 INFO: ./virtuoso-t() [0x5c5baf]
14:47:23 INFO: ./virtuoso-t() [0x5cc559]
14:47:23 INFO: ./virtuoso-t() [0x5987da]
14:47:23 INFO: ./virtuoso-t() [0x5cf75a]
14:47:23 INFO: ./virtuoso-t() [0x5d907f]
14:47:23 INFO: ./virtuoso-t() [0x5d92f9]
14:47:23 INFO: ./virtuoso-t() [0x8d8f80]
14:47:23 INFO: ./virtuoso-t() [0x8df733]
14:47:23 INFO: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fd6aad3a6db]
14:47:23 INFO: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fd6aa4c188f]
14:47:23 ERROR: GPF: Dkernel.c:5674 Out of memory
GPF: Dkernel.c:5674 Out of memory
Segmentation fault (core dumped)
HughWilliams commented 5 years ago

@szarnyasg: The version of your binary Version 07.20.3217-pthreads for Linux as of Dec 15 2017 is old ... there was a new 7.2.5.1 (07.20.3229) git stable/7 release made in August 2018, with many memory and resource consumption improvements, see release notes. Thus I would recommend upgrading to this latest binary version as a starting point ...

szarnyasg commented 5 years ago

@HughWilliams thanks for the quick reply. I bumped the version to 7.20.3229 and re-ran it on the 240 GB machine with the following configuration:

NumberOfBuffers           = 15000000
MaxDirtyBuffers           = 11500000

It crashed after exporting 83 661 586 024 bytes of edges, which is pretty much the same as before. The log follows below:

                Fri Nov 23 2018                                                                                                                                                                                    
19:11:04 INFO: { Loading plugin 1: Type `plain', file `wikiv' in `../lib/virtuoso/hosting'                                                                                                                         
19:11:04 ERROR:   FAILED  plugin 1: Unable to locate file }                                                                                                                                                        
19:11:04 INFO: { Loading plugin 2: Type `plain', file `mediawiki' in `../lib/virtuoso/hosting'                                                                                                                     
19:11:04 ERROR:   FAILED  plugin 2: Unable to locate file }                                                                                                                                                        
19:11:04 INFO: { Loading plugin 3: Type `plain', file `creolewiki' in `../lib/virtuoso/hosting'                                                                                                                    
19:11:04 ERROR:   FAILED  plugin 3: Unable to locate file }                                                                                                                                                        
19:11:04 INFO: OpenLink Virtuoso Universal Server                                                                                                                                                                  
19:11:04 INFO: Version 07.20.3229-pthreads for Linux as of Aug 15 2018                                                                                                                                             
19:11:04 INFO: uses parts of OpenSSL, PCRE, Html Tidy                                                                                                                                                              
19:11:40 INFO: Database version 3126                                                                                                                                                                               
19:11:40 INFO: SQL Optimizer enabled (max 1000 layouts)                                                                                                                                                            
19:11:41 INFO: Compiler unit is timed at 0.000157 msec                                                                                                                                                             
19:11:59 INFO: Roll forward started                                                                                                                                                                                
19:11:59 INFO: Roll forward complete                                                                                                                                                                               
19:12:02 INFO: PL LOG: Can't get list of vad packages in ../share/virtuoso/vad/                                          
19:12:03 INFO: Checkpoint started
19:12:04 INFO: Checkpoint finished, log reused
19:12:04 INFO: HTTP/WebDAV server online at 8887                                          
19:12:04 INFO: Server online at 1109 (pid 16378)           
20:12:06 INFO: Write load very high relative to disk write throughput.  Flushing at   6.4e+03 MB/s while application is making dirty pages at   6.4e+03 MB/s. To checkpoint the database, will now pause the workl$ad with 5394 MB unflushed.                                 
20:12:10 INFO: Checkpoint started                                                              
20:12:11 INFO: Checkpoint finished, log reused             
21:12:13 INFO: Write load very high relative to disk write throughput.  Flushing at   6.4e+03 MB/s while application is making dirty pages at   6.4e+03 MB/s. To checkpoint the database, will now pause the workl$ad with 5672 MB unflushed.                                            
21:12:18 INFO: Checkpoint started                    
21:12:19 INFO: Checkpoint finished, log reused
22:12:22 INFO: Write load very high relative to disk write throughput.  Flushing at   1.6e+04 MB/s while application is making dirty pages at   1.6e+04 MB/s. To checkpoint the database, will now pause the workl$ad with 14181 MB unflushed.                           
22:12:33 INFO: Checkpoint started  
22:12:34 INFO: Checkpoint finished, log reused                                                                                                                                                                    
23:12:37 INFO: Write load very high relative to disk write throughput.  Flushing at   9.8e+03 MB/s while application is making dirty pages at   9.8e+03 MB/s. To checkpoint the database, will now pause the workl$ad with 8868 MB unflushed.   
23:12:44 INFO: Checkpoint started                                                      
23:13:02 INFO: Checkpoint finished, log reused                        

                Sat Nov 24 2018
00:10:20 ERROR: Memory low! Using memory reserve to terminate current activities properly
00:10:20 ERROR: Current location of the program break 123371114496
00:10:20 ERROR: Current location of the program break 123371114496
00:10:20 INFO: ./virtuoso-t() [0x90f94a]
00:10:20 INFO: ./virtuoso-t() [0x90f9d5]
00:10:20 INFO: ./virtuoso-t() [0x9101f1]
00:10:20 INFO: ./virtuoso-t() [0x8fc176]
00:10:20 INFO: ./virtuoso-t() [0x8fe7a5]
00:10:20 INFO: ./virtuoso-t() [0x4d640c]
00:10:20 INFO: ./virtuoso-t() [0x4d91ef]
00:10:20 INFO: ./virtuoso-t() [0x55924c]
00:10:20 INFO: ./virtuoso-t(setp_node_input+0x1b) [0x5594eb]
00:10:20 INFO: ./virtuoso-t() [0x6087de]
00:10:20 INFO: ./virtuoso-t() [0x6087de]
00:10:20 INFO: ./virtuoso-t() [0x609072]
00:10:20 INFO: ./virtuoso-t(table_source_input+0x1cd) [0x60f8ed]
00:10:20 INFO: ./virtuoso-t() [0x5d8a57]
00:10:20 INFO: ./virtuoso-t() [0x60ba00]
00:10:20 INFO: ./virtuoso-t() [0x613b1f]
00:10:20 INFO: ./virtuoso-t(sf_sql_fetch_w+0x6f) [0x613daf]
00:10:20 INFO: ./virtuoso-t() [0x9164ec]
00:10:20 INFO: ./virtuoso-t() [0x91a433]
00:10:20 INFO: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe27dccd6db]
00:10:20 INFO: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe27d45488f]
00:10:20 ERROR: GPF: Dkernel.c:5684 Out of memory
GPF: Dkernel.c:5684 Out of memory
Segmentation fault (core dumped)

FWIW, these are the sizes of the outputs:

$ wc -l
767 999 498 biobench-edges.csv
290 289 525 biobench-nodes.csv
szarnyasg commented 5 years ago

Okay, I had a thought: I dropped DISTINCT and this time the export finish without issues in about 3 hours. So it seems the DB crashes when trying to produce unique elements.

The resulting file is 86 664 284 899 bytes which is about 3 GBs more than what previous failed attempts produced. It has 795 722 454 lines, 28 million more then the crashed attempts. The exported tuples are already unique, as confirmed by using the sort -u Unix utility (the sort process itself was very costly as it required to use the swap file, ran out of space and crashed on my first attempt, and only managed to finish once it got more swap space).