marklogic / marklogic-rdf4j

Support for MarkLogic exposed inthe rdf4j idiom.
Other
5 stars 3 forks source link

Adding large trig files via connection.add causes errors #10

Closed elky87 closed 7 years ago

elky87 commented 7 years ago

Hi, when I try to add a larger trig file (163MB) to a marklogic repository I get an exception that the marklogic server cannot parse the file.

the method I use:

connection.add(myLargeTrigFile, null, RDFFormat.TRIG);


Exception: ERROR] MarkLogicClientImpl Local message: failed to apply resource at graphs: Bad Request. Server Message: RESTAPI-INVALIDCONTENT: (err:FOER0000) Invalid content: XDMP-DOCUNEXPECTED1.0-mlxdmp:turtle($in, map:get($om, "passthru")) -- memory exhausted at :4091:167falsexdmp:turtle($in, map:get($om, "passthru"))memory exhausted4091167/MarkLogic/semantics.xqy4091167sem:rdf-parse(document{text{" <https://test-remotestore.semantic-web.at/meshdoublesize..."}}, "trig")indocument{text{" <https://test-remotestore.semantic-web.at/meshdoublesize..."}}options"trig"ommap:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)fmt"trig"repair()1.0-ml/MarkLogic/rest-api/models/semantics-model.xqy46813semmod:extract-triples-from-body(map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), (), document{text{" <https://test-remotestore.semantic-web.at/meshdoublesize..."}})headersmap:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)repair()bodydocument{text{" <https://test-remotestore.semantic-web.at/meshdoublesize..."}}content-type"application/trig"options()1.0-ml/MarkLogic/rest-api/models/semantics-model.xqy73154semmod:graph-insert(map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), document{text{" <https://test-remotestore.semantic-web.at/meshdoublesize..."}}, eput:config-callback#2, fn:true())headersmap:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)paramsmap:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>)bodydocument{text{" <https://test-remotestore.semantic-web.at/meshdoublesize..."}}callbackeput:config-callback#2append-permissionsfn:true()graph()_()content-type"application/trig"_()category()param-permissions()_()role-names()role-ids()request-permissions()repair()putative-result"CONTENT_UPDATED"old-permissions()update-perms()1.0-ml/MarkLogic/rest-api/models/semantics-model.xqy6234semmod:graph-insert(map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), map:map(<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" .../>), document{text{" <https://test-remotestore.semantic-web.at/meshdoublesize..."}}, eput:config-callback#2)1.0-ml/MarkLogic/rest-api/endpoints/graphstore-update.xqy4491.0-ml 2017-08-02 10:58:05 [FATAL] SnapshotService Restore of snapshot 'Marklogic_-_MeSH_double_size~1DF1729E-F1B0-0001-5F47-14601A306B20~20170801184949324~system' FAILED! Rolling back transaction. org.eclipse.rdf4j.rio.RDFParseException: Request to MarkLogic server failed, check file and format. at com.marklogic.semantics.sesame.client.MarkLogicClientImpl.performAdd(MarkLogicClientImpl.java:295) at com.marklogic.semantics.sesame.client.MarkLogicClient.sendAdd(MarkLogicClient.java:303) at com.marklogic.semantics.sesame.MarkLogicRepositoryConnection.add(MarkLogicRepositoryConnection.java:959)


From the exception it looks like the server runs maybe out of memory? Because of " -- memory exhausted at :4091:167"

But I also read here: https://docs.marklogic.com/guide/ingestion/formats#id_33599 That there are some file size limitations to txt files, xml files etc.

Are there any limitations of filesize regarding this method? Or is this maybe just some configuration issue on our side?

grechaw commented 7 years ago

Hi @elky87 , @akshaysonvane and I talked over this issue earlier today. He added a test to verify loading a TRiG file that is about 275M. It takes a while to run, but it does pass. I know that the server-side parsing of TriG (how do those capitals go I wonder?) it probably very memory-intensive, as it has to read the whole file in order to parse (it can't stream it in). I'd probably look for workarounds such as using nquads.

We do want to understand the limitations of ingestion though, so the story on this issue is not over.

We will try to reproduce -- it look like it's related to your available memory, not to file size limitations. But it could also be that this limitation is a bug, so we'll investigate further.

elky87 commented 7 years ago

If you want I can share with you the file that causes the issue on our side. Maybe the file has some internals that cause the memory consumption to blow up. To whom should I send the file, to @grechaw or @akshaysonvane ?

(According to the standard its TriG, but I have no idea what it stands for :D triples graph maybe? who knows )

grechaw commented 7 years ago

If you put it on swc's wiki, I think @akshaysonvane can get access and download it. Perhaps it is already there.

elky87 commented 7 years ago

I put it on the wiki. You find it at the page "rdf4j integration" and the file is linked under "causing file"

grechaw commented 7 years ago

Thanks @elky87

grechaw commented 7 years ago

From internal comment --

asonvane Akshay Sonvane added a comment - 10 minutes ago

Tested with the new file (~171MB) gets ingested without any hiccups. The problem must be with the available system memory. Will run the test in a VM with less memory.

grechaw commented 7 years ago

Hi @elky87 , we're still not able to reproduce this particular issue. I'm not sure whether something has been fixed in the versions we're working with (doesn't seem likely). What version of the server are you using here?

elky87 commented 7 years ago

We are using version 8.0-6.4 Enterprise Edition

grechaw commented 7 years ago

Thanks, we'll check. I think that the memory profile may have improved in the versions we've been using.

grechaw commented 7 years ago

Hi @elky87 , @akshaysonvane was able to reproduce with 8.0-6.4. So a fix I made for 9.0-2 and 8.0-7 happens to fix this issue as well. 8.0-7 is in the stabilization phase and will ship within the next couple of weeks.

I'll leave this issue open until you can confirm. Note I think that you'll get better performance from n-quads than from trig.

elky87 commented 7 years ago

Hi @grechaw and @akshaysonvane thanks for thorough testing to reproduce this issue, we plan to upgrade to 9.0 soon. After we've done that I will test this issue again.

elky87 commented 7 years ago

Hi, took us quite some time to make the switch to 9.0 but I just tested it and it works now. Thanks for the effort!