Problem loading wikidata: java.lang.OutOfMemoryError: Requested array size exceeds VM limit #194

cbuil commented 1 year ago

Dear all,

I am trying to load Wikidata truthy but I'm getting the exception java.lang.OutOfMemoryError: Requested array size exceeds VM limit (more details below).

I run the script ./bin/rdf2hdt.sh with -Xmx500G on a server with 136GB of RAM memory and I use a swap on a 500GB SSD disk. Any idea of how to fix that error?


ate47 commented 1 year ago

First to be sure, to use -Xmx500G, you need to put it in the javaenv.sh file, not in the rdf2hdt.sh parameters.

That said, from this paper [1] you can see the speed of the default HDT generation algorithm with 128GB of RAM:


You can see it's really bad with a lot triples, for now, no algorithm to create HDT with a lot of memory are implemented in RDF-HDT Java, but if you have a 500GB SSD, you may want to use a disk based version, I wrote a wiki a while ago on how to do it here if you want.


create a file named option.hdtspec with that inside:


You can set loader.cattree.loadertype=memory instead of disk, but I've never tried it with that much memory

and run this command with the CLI:

rdf2hdt.sh -multithread -config option.hdtspec latest-all.nt.gz wikidata.hdt

"latest-all.nt.gz" is the name of your dataset, for the truthy statements 500GB is fine, but I think you need around 600GB for all the statements.

[1] Diefenbach, D., & Giménez-García, J. M. (2020). HDTCat: let’s make HDT generation scale. In The Semantic Web–ISWC 2020: 19th International Semantic Web Conference, Athens, Greece, November 2–6, 2020, Proceedings, Part II 19 (pp. 18-33). Springer International Publishing.

D063520 commented 1 year ago


you can also find an already compressed file here:


If you need a SPARQL endpoint over Wikidata that uses HDT you can check out this:


Salut D063520

cbuil commented 1 year ago

Thanks a lot, just trying your solution right now.

cbuil commented 1 year ago


I just loaded my Wikidata file, however I got a maven error, which I guess it is not important, but I would like to double check that no data is lost, the following contains the process output.


File converted in ..... 4 hour 46 min 38 sec 965 ms 749 us
Total Triples ......... 1253567798
Different subjects .... 92498623
Different predicates .. 8604
Different objects ..... 305877616
Common Subject/Object . 33580954
HDT saved to file in .. 21 sec 352 ms 372 us
[WARNING] thread Thread[ForkJoinPool.commonPool-worker-49,5,org.rdfhdt.hdt.tools.RDF2HDT] was interrupted but is still alive after waiting at least 15000msecs
[WARNING] thread Thread[ForkJoinPool.commonPool-worker-49,5,org.rdfhdt.hdt.tools.RDF2HDT] will linger despite being asked to die via interruption
[WARNING] NOTE: 1 thread(s) did not finish despite being asked to via interruption. This is not a problem with exec:java, it is a problem with the running code . Although not serious, it should be remedied.
[WARNING] Couldn't destroy threadgroup org.codehaus.mojo.exec.ExecJavaMojo$IsolatedThreadGroup[name=org.rdfhdt.hdt.tools.RDF2HDT,maxpri=10]
ate47 commented 1 year ago

You can check the HDT structure using hdtVerify.sh -unicode <hdt-file> (with -progress -color if you want a progress bar), but to check the dataset itself, you don't have anything, so if you really need it, you can code a program to do it.

Looking at your stats I'm assuming you're using the dataset from the WDBench (Only the WD direct properties) because I have the same information in my HDT version

(You can run head -n 20 <your-hdt.hdt> to get the header of the hdt with the stats)

My header (can differ from few bytes due to the base URI / date)

> head -n 20 wdb.hdt
$HDT☺<http://purl.org/HDT/hdt#HDTv1>v5$HDT☻ntripleslength=1869;S}<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/HDT/hdt#Dataset> .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#triples> "1253567798" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#properties> "8604" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#distinctSubjects> "92498623" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://rdfs.org/ns/void#distinctObjects> "305877616" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#formatInformation> "_:format" .
_:format <http://purl.org/HDT/hdt#dictionary> "_:dictionary" .
_:format <http://purl.org/HDT/hdt#triples> "_:triples" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#statisticalInformation> "_:statistics" .
<file:///N:/WDBench/./truthy_direct_properties.nt.bz2> <http://purl.org/HDT/hdt#publicationInformation> "_:publicationInformation" .
_:dictionary <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#dictionaryFour> .
_:dictionary <http://purl.org/HDT/hdt#dictionarynumSharedSubjectObject> "33580954" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "364803889" .
_:triples <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#triplesBitmap> .
_:triples <http://purl.org/HDT/hdt#triplesnumTriples> "1253567798" .
_:triples <http://purl.org/HDT/hdt#triplesOrder> "SPO" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "5379075081" .
_:publicationInformation <http://purl.org/dc/terms/issued> "2023-01-25T21:04Z" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "156232317951" .

And if you want to compare the sizes

> ls .\wdb.hdt

        Directory: N:\qendpoint-store\hdt-store

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a---        25/01/2023     22:07    13142391087 wdb.hdt
cbuil commented 1 year ago

Thanks a lot, it worked flawlessly. And yes, I am using the WDBench Wikidata file.
