Open szarnyasg opened 5 years ago
@szarnyasg: The version of your binary Version 07.20.3217-pthreads for Linux as of Dec 15 2017
is old ... there was a new 7.2.5.1 (07.20.3229)
git stable/7 release made in August 2018, with many memory and resource consumption improvements, see release notes. Thus I would recommend upgrading to this latest binary version as a starting point ...
@HughWilliams thanks for the quick reply. I bumped the version to 7.20.3229
and re-ran it on the 240 GB machine with the following configuration:
NumberOfBuffers = 15000000
MaxDirtyBuffers = 11500000
It crashed after exporting 83 661 586 024 bytes of edges, which is pretty much the same as before. The log follows below:
Fri Nov 23 2018
19:11:04 INFO: { Loading plugin 1: Type `plain', file `wikiv' in `../lib/virtuoso/hosting'
19:11:04 ERROR: FAILED plugin 1: Unable to locate file }
19:11:04 INFO: { Loading plugin 2: Type `plain', file `mediawiki' in `../lib/virtuoso/hosting'
19:11:04 ERROR: FAILED plugin 2: Unable to locate file }
19:11:04 INFO: { Loading plugin 3: Type `plain', file `creolewiki' in `../lib/virtuoso/hosting'
19:11:04 ERROR: FAILED plugin 3: Unable to locate file }
19:11:04 INFO: OpenLink Virtuoso Universal Server
19:11:04 INFO: Version 07.20.3229-pthreads for Linux as of Aug 15 2018
19:11:04 INFO: uses parts of OpenSSL, PCRE, Html Tidy
19:11:40 INFO: Database version 3126
19:11:40 INFO: SQL Optimizer enabled (max 1000 layouts)
19:11:41 INFO: Compiler unit is timed at 0.000157 msec
19:11:59 INFO: Roll forward started
19:11:59 INFO: Roll forward complete
19:12:02 INFO: PL LOG: Can't get list of vad packages in ../share/virtuoso/vad/
19:12:03 INFO: Checkpoint started
19:12:04 INFO: Checkpoint finished, log reused
19:12:04 INFO: HTTP/WebDAV server online at 8887
19:12:04 INFO: Server online at 1109 (pid 16378)
20:12:06 INFO: Write load very high relative to disk write throughput. Flushing at 6.4e+03 MB/s while application is making dirty pages at 6.4e+03 MB/s. To checkpoint the database, will now pause the workl$ad with 5394 MB unflushed.
20:12:10 INFO: Checkpoint started
20:12:11 INFO: Checkpoint finished, log reused
21:12:13 INFO: Write load very high relative to disk write throughput. Flushing at 6.4e+03 MB/s while application is making dirty pages at 6.4e+03 MB/s. To checkpoint the database, will now pause the workl$ad with 5672 MB unflushed.
21:12:18 INFO: Checkpoint started
21:12:19 INFO: Checkpoint finished, log reused
22:12:22 INFO: Write load very high relative to disk write throughput. Flushing at 1.6e+04 MB/s while application is making dirty pages at 1.6e+04 MB/s. To checkpoint the database, will now pause the workl$ad with 14181 MB unflushed.
22:12:33 INFO: Checkpoint started
22:12:34 INFO: Checkpoint finished, log reused
23:12:37 INFO: Write load very high relative to disk write throughput. Flushing at 9.8e+03 MB/s while application is making dirty pages at 9.8e+03 MB/s. To checkpoint the database, will now pause the workl$ad with 8868 MB unflushed.
23:12:44 INFO: Checkpoint started
23:13:02 INFO: Checkpoint finished, log reused
Sat Nov 24 2018
00:10:20 ERROR: Memory low! Using memory reserve to terminate current activities properly
00:10:20 ERROR: Current location of the program break 123371114496
00:10:20 ERROR: Current location of the program break 123371114496
00:10:20 INFO: ./virtuoso-t() [0x90f94a]
00:10:20 INFO: ./virtuoso-t() [0x90f9d5]
00:10:20 INFO: ./virtuoso-t() [0x9101f1]
00:10:20 INFO: ./virtuoso-t() [0x8fc176]
00:10:20 INFO: ./virtuoso-t() [0x8fe7a5]
00:10:20 INFO: ./virtuoso-t() [0x4d640c]
00:10:20 INFO: ./virtuoso-t() [0x4d91ef]
00:10:20 INFO: ./virtuoso-t() [0x55924c]
00:10:20 INFO: ./virtuoso-t(setp_node_input+0x1b) [0x5594eb]
00:10:20 INFO: ./virtuoso-t() [0x6087de]
00:10:20 INFO: ./virtuoso-t() [0x6087de]
00:10:20 INFO: ./virtuoso-t() [0x609072]
00:10:20 INFO: ./virtuoso-t(table_source_input+0x1cd) [0x60f8ed]
00:10:20 INFO: ./virtuoso-t() [0x5d8a57]
00:10:20 INFO: ./virtuoso-t() [0x60ba00]
00:10:20 INFO: ./virtuoso-t() [0x613b1f]
00:10:20 INFO: ./virtuoso-t(sf_sql_fetch_w+0x6f) [0x613daf]
00:10:20 INFO: ./virtuoso-t() [0x9164ec]
00:10:20 INFO: ./virtuoso-t() [0x91a433]
00:10:20 INFO: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe27dccd6db]
00:10:20 INFO: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe27d45488f]
00:10:20 ERROR: GPF: Dkernel.c:5684 Out of memory
GPF: Dkernel.c:5684 Out of memory
Segmentation fault (core dumped)
FWIW, these are the sizes of the outputs:
$ wc -l
767 999 498 biobench-edges.csv
290 289 525 biobench-nodes.csv
Okay, I had a thought: I dropped DISTINCT
and this time the export finish without issues in about 3 hours. So it seems the DB crashes when trying to produce unique elements.
The resulting file is 86 664 284 899 bytes which is about 3 GBs more than what previous failed attempts produced. It has 795 722 454
lines, 28 million more then the crashed attempts.
The exported tuples are already unique, as confirmed by using the sort -u
Unix utility (the sort
process itself was very costly as it required to use the swap file, ran out of space and crashed on my first attempt, and only managed to finish once it got more swap space).
I am working on graph database benchmarking and graph analytics as a member of the Linked Data Benchmark Council (cc @mirkospasic who has been a member) and also collaborate with @saleem-muhammad of the AKSW group.
I tried to export all nodes and edges for graph analysis from benchmark datasets that are available as Virtuoso instances. I used the
isql
interface though the following Bash script:Here are the SPARQL queries formatted for readability.
Query for extracting nodes:
Query for extracting edges:
I ran into out-of-memory issues when exporting edges for the BioBench dataset, which consists of approx 1.45B triples. I used two cloud VMs for my experiments:
In both cases, I increased the number of buffers in
virtuoso.ini
by scaling up the recommended values to the available system memory.For 240 GB system RAM:
For 750 GB system RAM:
Are there any ways to get these queries working? Alternatively, if I could dump the entire database to an
nt
file, that'd work perfectly well for my use case - I could simply pipe the file throughgrep
and get desired triples.The log for the crash is below: