Closed cbuil closed 1 year ago
First to be sure, to use -Xmx500G
, you need to put it in the javaenv.sh file, not in the rdf2hdt.sh parameters.
That said, from this paper [1] you can see the speed of the default HDT generation algorithm with 128GB of RAM:
You can see it's really bad with a lot triples, for now, no algorithm to create HDT with a lot of memory are implemented in RDF-HDT Java, but if you have a 500GB SSD, you may want to use a disk based version, I wrote a wiki a while ago on how to do it here if you want.
TL;TR:
create a file named option.hdtspec with that inside:
loader.cattree.futureHDTLocation=cfuture.hdt
loader.cattree.loadertype=disk
loader.cattree.location=cattree
loader.cattree.memoryFaultFactor=1
loader.disk.futureHDTLocation=future.hdt
loader.disk.location=gen
loader.type=cat
parser.ntSimpleParser=true
loader.disk.compressWorker=3
profiler=true
profiler.output=prof.opt
loader.cattree.kcat=20
hdtcat.location=catgen
hdtcat.location.future=catgen.hdt
You can set loader.cattree.loadertype=memory
instead of disk
, but I've never tried it with that much memory
and run this command with the CLI:
rdf2hdt.sh -multithread -config option.hdtspec latest-all.nt.gz wikidata.hdt
"latest-all.nt.gz" is the name of your dataset, for the truthy statements 500GB is fine, but I think you need around 600GB for all the statements.
[1] Diefenbach, D., & Giménez-García, J. M. (2020). HDTCat: let’s make HDT generation scale. In The Semantic Web–ISWC 2020: 19th International Semantic Web Conference, Athens, Greece, November 2–6, 2020, Proceedings, Part II 19 (pp. 18-33). Springer International Publishing.
Hi,
you can also find an already compressed file here:
https://qanswer-svc4.univ-st-etienne.fr
If you need a SPARQL endpoint over Wikidata that uses HDT you can check out this:
https://hub.docker.com/r/qacompany/qendpoint-wikidata
Salut D063520
Thanks a lot, just trying your solution right now.
Dear,
I just loaded my Wikidata file, however I got a maven error, which I guess it is not important, but I would like to double check that no data is lost, the following contains the process output.
Thanks!
File converted in ..... 4 hour 46 min 38 sec 965 ms 749 us
Total Triples ......... 1253567798
Different subjects .... 92498623
Different predicates .. 8604
Different objects ..... 305877616
Common Subject/Object . 33580954
HDT saved to file in .. 21 sec 352 ms 372 us
[WARNING] thread Thread[ForkJoinPool.commonPool-worker-49,5,org.rdfhdt.hdt.tools.RDF2HDT] was interrupted but is still alive after waiting at least 15000msecs
[WARNING] thread Thread[ForkJoinPool.commonPool-worker-49,5,org.rdfhdt.hdt.tools.RDF2HDT] will linger despite being asked to die via interruption
[WARNING] NOTE: 1 thread(s) did not finish despite being asked to via interruption. This is not a problem with exec:java, it is a problem with the running code
. Although not serious, it should be remedied.
[WARNING] Couldn't destroy threadgroup org.codehaus.mojo.exec.ExecJavaMojo$IsolatedThreadGroup[name=org.rdfhdt.hdt.tools.RDF2HDT,maxpri=10]
java.lang.IllegalThreadStateException
at java.lang.ThreadGroup.destroy (ThreadGroup.java:776)
at org.codehaus.mojo.exec.ExecJavaMojo.execute (ExecJavaMojo.java:321)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
You can check the HDT structure using hdtVerify.sh -unicode <hdt-file>
(with -progress -color
if you want a progress bar), but to check the dataset itself, you don't have anything, so if you really need it, you can code a program to do it.
Looking at your stats I'm assuming you're using the dataset from the WDBench (Only the WD direct properties) because I have the same information in my HDT version
(You can run head -n 20 <your-hdt.hdt>
to get the header of the hdt with the stats)
My header (can differ from few bytes due to the base URI / date)
> head -n 20 wdb.hdt
$HDT☺<http://purl.org/HDT/hdt#HDTv1>v5$HDT☻ntripleslength=1869;S}<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/HDT/hdt#Dataset> .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#triples> "1253567798" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#properties> "8604" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#distinctSubjects> "92498623" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#distinctObjects> "305877616" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#formatInformation> "_:format" .
_:format <http://purl.org/HDT/hdt#dictionary> "_:dictionary" .
_:format <http://purl.org/HDT/hdt#triples> "_:triples" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#statisticalInformation> "_:statistics" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#publicationInformation> "_:publicationInformation" .
_:dictionary <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#dictionaryFour> .
_:dictionary <http://purl.org/HDT/hdt#dictionarynumSharedSubjectObject> "33580954" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "364803889" .
_:triples <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#triplesBitmap> .
_:triples <http://purl.org/HDT/hdt#triplesnumTriples> "1253567798" .
_:triples <http://purl.org/HDT/hdt#triplesOrder> "SPO" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "5379075081" .
_:publicationInformation <http://purl.org/dc/terms/issued> "2023-01-25T21:04Z" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "156232317951" .
And if you want to compare the sizes
> ls .\wdb.hdt
Directory: N:\qendpoint-store\hdt-store
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 25/01/2023 22:07 13142391087 wdb.hdt
Thanks a lot, it worked flawlessly. And yes, I am using the WDBench Wikidata file.
Best
Dear all,
I am trying to load Wikidata truthy but I'm getting the exception java.lang.OutOfMemoryError: Requested array size exceeds VM limit (more details below).
I run the script ./bin/rdf2hdt.sh with -Xmx500G on a server with 136GB of RAM memory and I use a swap on a 500GB SSD disk. Any idea of how to fix that error?
Thanks
[WARNING] to BitmapTriples 99.9946
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf (Arrays.java:3745)
at java.io.ByteArrayOutputStream.grow (ByteArrayOutputStream.java:120)
at java.io.ByteArrayOutputStream.ensureCapacity (ByteArrayOutputStream.java:95)
at java.io.ByteArrayOutputStream.write (ByteArrayOutputStream.java:156)
at org.rdfhdt.hdt.util.string.ByteStringUtil.append (ByteStringUtil.java:369)
at org.rdfhdt.hdt.util.string.ByteStringUtil.append (ByteStringUtil.java:346)
at org.rdfhdt.hdt.dictionary.impl.section.PFCDictionarySection.load (PFCDictionarySection.java:124)
at org.rdfhdt.hdt.dictionary.impl.section.PFCDictionarySection.load (PFCDictionarySection.java:88)
at org.rdfhdt.hdt.dictionary.impl.FourSectionDictionary.load (FourSectionDictionary.java:86)
at org.rdfhdt.hdt.hdt.impl.HDTImpl.loadFromModifiableHDT (HDTImpl.java:360)
at org.rdfhdt.hdt.hdt.HDTManagerImpl.doGenerateHDT (HDTManagerImpl.java:173)
at org.rdfhdt.hdt.hdt.HDTManager.generateHDT (HDTManager.java:441)
at org.rdfhdt.hdt.tools.RDF2HDT.execute (RDF2HDT.java:242)
at org.rdfhdt.hdt.tools.RDF2HDT.main (RDF2HDT.java:344)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:829)