error in LongArrayDisk when trying to run qepSearch.sh on large HDT file

balhoff commented 4 months ago

Part of the endpoint? (leave empty if you don't know)

[ ] Backend (qendpoint-backend)
[ ] Store (qendpoint-backend)
[ ] Core (qendpoint-core)
[ ] Frontend (qendpoint-frontend)
[X] Other

Description of the issue

I'm trying to create an index for a huge HDT file (29,773,033,292 triples). I'm doing this by trying to start qepSearch.sh.

Excepted behavior

I expect a file mytriples.hdt.index.v1-1 to be generated, and then be able to search for triples.

Obtained behavior

After about 20 minutes, I get this output:

10:16:06,369 |-INFO in ch.qos.logback.classic.LoggerContext[default] - This is logback-classic version 1.4.5
10:16:06,441 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
10:16:06,446 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/balhoff/qendpoint-cli-1.16.1/lib/qendpoint-1.16.1.jar!/logback.xml]
10:16:06,448 |-WARN in ch.qos.logback.classic.util.DefaultJoranConfigurator@45b9a632 - Resource [logback.xml] occurs multiple times on the classpath.
10:16:06,448 |-WARN in ch.qos.logback.classic.util.DefaultJoranConfigurator@45b9a632 - Resource [logback.xml] occurs at [jar:file:/home/balhoff/qendpoint-cli-1.16.1/lib/qendpoint-1.16.1.jar!/logback.xml]
10:16:06,448 |-WARN in ch.qos.logback.classic.util.DefaultJoranConfigurator@45b9a632 - Resource [logback.xml] occurs at [jar:file:/home/balhoff/qendpoint-cli-1.16.1/lib/qendpoint-backend-1.16.1.jar!/logback.xml]
10:16:06,455 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@25d250c6 - URL [jar:file:/home/balhoff/qendpoint-cli-1.16.1/lib/qendpoint-1.16.1.jar!/logback.xml] is not of type file
10:16:06,610 |-INFO in ch.qos.logback.core.model.processor.AppenderModelHandler - Processing appender named [STDOUT]
10:16:06,611 |-INFO in ch.qos.logback.core.model.processor.AppenderModelHandler - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
10:16:06,620 |-INFO in ch.qos.logback.core.model.processor.ImplicitModelHandler - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
10:16:06,636 |-INFO in ch.qos.logback.classic.model.processor.RootLoggerModelHandler - Setting level of ROOT logger to INFO
10:16:06,636 |-INFO in ch.qos.logback.core.model.processor.AppenderRefModelHandler - Attaching appender named [STDOUT] to Logger[ROOT]
10:16:06,637 |-INFO in ch.qos.logback.core.model.processor.DefaultProcessor@79e2c065 - End of configuration.
10:16:06,639 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@36bc55de - Registering current configuration as safe fallback point
[main][          ] 0.00  reading buffer
10:32:41.515 [main] INFO  c.t.q.c.triples.impl.BitmapTriples - Count Objects in 15 min 54 sec 607 ms 784 us Max was: 2137208329
10:33:28.540 [main] INFO  c.t.q.c.triples.impl.BitmapTriples - Bitmap in 47 sec 16 ms 286 us
Exception in thread "main" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
    at com.the_qa_company.qendpoint.core.util.disk.LongArrayDisk.set0(LongArrayDisk.java:236)
    at com.the_qa_company.qendpoint.core.util.disk.LongArrayDisk.clear(LongArrayDisk.java:289)
    at com.the_qa_company.qendpoint.core.util.disk.LongArrayDisk.<init>(LongArrayDisk.java:95)
    at com.the_qa_company.qendpoint.core.util.disk.LongArrayDisk.<init>(LongArrayDisk.java:62)
    at com.the_qa_company.qendpoint.core.util.disk.LongArrayDisk.<init>(LongArrayDisk.java:58)
    at com.the_qa_company.qendpoint.core.compact.sequence.SequenceLog64BigDisk.<init>(SequenceLog64BigDisk.java:80)
    at com.the_qa_company.qendpoint.core.compact.sequence.SequenceLog64BigDisk.<init>(SequenceLog64BigDisk.java:72)
    at com.the_qa_company.qendpoint.core.triples.impl.BitmapTriples$1.<init>(BitmapTriples.java:514)
    at com.the_qa_company.qendpoint.core.triples.impl.BitmapTriples.createSequence64(BitmapTriples.java:514)
    at com.the_qa_company.qendpoint.core.triples.impl.BitmapTriples.createIndexObjectMemoryEfficient(BitmapTriples.java:773)
    at com.the_qa_company.qendpoint.core.triples.impl.BitmapTriples.generateIndex(BitmapTriples.java:1005)
    at com.the_qa_company.qendpoint.core.hdt.impl.HDTImpl.loadOrCreateIndex(HDTImpl.java:526)
    at com.the_qa_company.qendpoint.core.hdt.HDTManagerImpl.doMapIndexedHDT(HDTManagerImpl.java:99)
    at com.the_qa_company.qendpoint.core.hdt.HDTManager.mapIndexedHDT(HDTManager.java:448)
    at com.the_qa_company.qendpoint.tools.QEPSearch.executeHDT(QEPSearch.java:361)
    at com.the_qa_company.qendpoint.tools.QEPSearch.execute(QEPSearch.java:934)
    at com.the_qa_company.qendpoint.tools.QEPSearch.main(QEPSearch.java:1322)

How to reproduce

Using JDK 17.0.2, export JAVA_OPTIONS="-Xmx500G -XX:+UseParallelGC". Then:

qepSearch.sh mytriples.hdt

The file mytriples.hdt is 344 GB. I can provide somehow if it is helpful.

Endpoint version

1.16.1

Do I want to contribute to fix it?

Maybe

Something else?

No response

ate47 commented 4 months ago

Most of the memory implementations are old and not really reliable for large datasets (at least 1B triples). I suggest you to only use disk implementation for this kind of workload.

To enable the disk indexing you can use these configs:

# use disk implementation
bitmaptriples.indexmethod=disk
# directory to compute the index
bitmaptriples.sequence.disk.location=disk-work-dir
# use disk locations and indexes
bitmaptriples.sequence.disk=true
bitmaptriples.sequence.disk.subindex=true

It can be done in with the -config or -options params

balhoff commented 4 months ago

@ate47 thank you! Your suggestion worked perfectly.

the-qa-company / qEndpoint