rdfhdt / hdt-java

HDT Java library and tools.
Other
94 stars 69 forks source link

Problem loading wikidata: java.lang.OutOfMemoryError: Requested array size exceeds VM limit #194

Closed cbuil closed 1 year ago

cbuil commented 1 year ago

Dear all,

I am trying to load Wikidata truthy but I'm getting the exception java.lang.OutOfMemoryError: Requested array size exceeds VM limit (more details below).

I run the script ./bin/rdf2hdt.sh with -Xmx500G on a server with 136GB of RAM memory and I use a swap on a 500GB SSD disk. Any idea of how to fix that error?

Thanks

[WARNING] to BitmapTriples 99.9946
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf (Arrays.java:3745)
at java.io.ByteArrayOutputStream.grow (ByteArrayOutputStream.java:120)
at java.io.ByteArrayOutputStream.ensureCapacity (ByteArrayOutputStream.java:95)
at java.io.ByteArrayOutputStream.write (ByteArrayOutputStream.java:156)
at org.rdfhdt.hdt.util.string.ByteStringUtil.append (ByteStringUtil.java:369)
at org.rdfhdt.hdt.util.string.ByteStringUtil.append (ByteStringUtil.java:346)
at org.rdfhdt.hdt.dictionary.impl.section.PFCDictionarySection.load (PFCDictionarySection.java:124)
at org.rdfhdt.hdt.dictionary.impl.section.PFCDictionarySection.load (PFCDictionarySection.java:88)
at org.rdfhdt.hdt.dictionary.impl.FourSectionDictionary.load (FourSectionDictionary.java:86)
at org.rdfhdt.hdt.hdt.impl.HDTImpl.loadFromModifiableHDT (HDTImpl.java:360)
at org.rdfhdt.hdt.hdt.HDTManagerImpl.doGenerateHDT (HDTManagerImpl.java:173)
at org.rdfhdt.hdt.hdt.HDTManager.generateHDT (HDTManager.java:441)
at org.rdfhdt.hdt.tools.RDF2HDT.execute (RDF2HDT.java:242)
at org.rdfhdt.hdt.tools.RDF2HDT.main (RDF2HDT.java:344)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:829)

ate47 commented 1 year ago

First to be sure, to use -Xmx500G, you need to put it in the javaenv.sh file, not in the rdf2hdt.sh parameters.

That said, from this paper [1] you can see the speed of the default HDT generation algorithm with 128GB of RAM:

image

You can see it's really bad with a lot triples, for now, no algorithm to create HDT with a lot of memory are implemented in RDF-HDT Java, but if you have a 500GB SSD, you may want to use a disk based version, I wrote a wiki a while ago on how to do it here if you want.

TL;TR:

create a file named option.hdtspec with that inside:

loader.cattree.futureHDTLocation=cfuture.hdt
loader.cattree.loadertype=disk
loader.cattree.location=cattree
loader.cattree.memoryFaultFactor=1
loader.disk.futureHDTLocation=future.hdt
loader.disk.location=gen
loader.type=cat
parser.ntSimpleParser=true
loader.disk.compressWorker=3
profiler=true
profiler.output=prof.opt
loader.cattree.kcat=20
hdtcat.location=catgen
hdtcat.location.future=catgen.hdt

You can set loader.cattree.loadertype=memory instead of disk, but I've never tried it with that much memory

and run this command with the CLI:

rdf2hdt.sh -multithread -config option.hdtspec latest-all.nt.gz wikidata.hdt

"latest-all.nt.gz" is the name of your dataset, for the truthy statements 500GB is fine, but I think you need around 600GB for all the statements.

[1] Diefenbach, D., & Giménez-García, J. M. (2020). HDTCat: let’s make HDT generation scale. In The Semantic Web–ISWC 2020: 19th International Semantic Web Conference, Athens, Greece, November 2–6, 2020, Proceedings, Part II 19 (pp. 18-33). Springer International Publishing.

D063520 commented 1 year ago

Hi,

you can also find an already compressed file here:

https://qanswer-svc4.univ-st-etienne.fr

If you need a SPARQL endpoint over Wikidata that uses HDT you can check out this:

https://hub.docker.com/r/qacompany/qendpoint-wikidata

Salut D063520

cbuil commented 1 year ago

Thanks a lot, just trying your solution right now.

cbuil commented 1 year ago

Dear,

I just loaded my Wikidata file, however I got a maven error, which I guess it is not important, but I would like to double check that no data is lost, the following contains the process output.

Thanks!

File converted in ..... 4 hour 46 min 38 sec 965 ms 749 us
Total Triples ......... 1253567798
Different subjects .... 92498623
Different predicates .. 8604
Different objects ..... 305877616
Common Subject/Object . 33580954
HDT saved to file in .. 21 sec 352 ms 372 us
[WARNING] thread Thread[ForkJoinPool.commonPool-worker-49,5,org.rdfhdt.hdt.tools.RDF2HDT] was interrupted but is still alive after waiting at least 15000msecs
[WARNING] thread Thread[ForkJoinPool.commonPool-worker-49,5,org.rdfhdt.hdt.tools.RDF2HDT] will linger despite being asked to die via interruption
[WARNING] NOTE: 1 thread(s) did not finish despite being asked to via interruption. This is not a problem with exec:java, it is a problem with the running code . Although not serious, it should be remedied.
[WARNING] Couldn't destroy threadgroup org.codehaus.mojo.exec.ExecJavaMojo$IsolatedThreadGroup[name=org.rdfhdt.hdt.tools.RDF2HDT,maxpri=10]
java.lang.IllegalThreadStateException
at java.lang.ThreadGroup.destroy (ThreadGroup.java:776)
at org.codehaus.mojo.exec.ExecJavaMojo.execute (ExecJavaMojo.java:321) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)

ate47 commented 1 year ago

You can check the HDT structure using hdtVerify.sh -unicode <hdt-file> (with -progress -color if you want a progress bar), but to check the dataset itself, you don't have anything, so if you really need it, you can code a program to do it.

Looking at your stats I'm assuming you're using the dataset from the WDBench (Only the WD direct properties) because I have the same information in my HDT version

(You can run head -n 20 <your-hdt.hdt> to get the header of the hdt with the stats)

My header (can differ from few bytes due to the base URI / date)

> head -n 20 wdb.hdt
$HDT☺<http://purl.org/HDT/hdt#HDTv1>v5$HDT☻ntripleslength=1869;S}<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/HDT/hdt#Dataset> .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#triples> "1253567798" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#properties> "8604" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#distinctSubjects> "92498623" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#distinctObjects> "305877616" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#formatInformation> "_:format" .
_:format <http://purl.org/HDT/hdt#dictionary> "_:dictionary" .
_:format <http://purl.org/HDT/hdt#triples> "_:triples" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#statisticalInformation> "_:statistics" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#publicationInformation> "_:publicationInformation" .
_:dictionary <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#dictionaryFour> .
_:dictionary <http://purl.org/HDT/hdt#dictionarynumSharedSubjectObject> "33580954" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "364803889" .
_:triples <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#triplesBitmap> .
_:triples <http://purl.org/HDT/hdt#triplesnumTriples> "1253567798" .
_:triples <http://purl.org/HDT/hdt#triplesOrder> "SPO" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "5379075081" .
_:publicationInformation <http://purl.org/dc/terms/issued> "2023-01-25T21:04Z" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "156232317951" .

And if you want to compare the sizes

> ls .\wdb.hdt

        Directory: N:\qendpoint-store\hdt-store

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a---        25/01/2023     22:07    13142391087 wdb.hdt
cbuil commented 1 year ago

Thanks a lot, it worked flawlessly. And yes, I am using the WDBench Wikidata file.

Best