mff-uk / odcs

ODCleanStore
1 stars 11 forks source link

Running big RDF file #112

Closed tomas-knap closed 11 years ago

tomas-knap commented 11 years ago

I tried scenario Scenario 1: TED - Big file, file has approx 2.3GB.

The extraction was successfully executed (at least based on the DPU Extractor events), however, when looking to the backend working, the generated file (which should not be there in case of Virtuoso, see bug #110 ) has 5GB+, which is twice as much. This could be caused by accidentally using a named graph already used before by a different pipeline. But I think that this bug was already solved.

screen shot 2013-07-15 at 4 42 36 pm

Also loader was not successful, resulting in:

screen shot 2013-07-15 at 4 41 11 pm

If you think that Petr can advice on the second issued (with the loader) , ask him or assign to him.

Jirka, please try to run the big file as well. In my case, it took about an hour. And write report here.

tomesj commented 11 years ago

Peter, try solve this problem with context for DPURecord and DPU merge, then assign me back :-)

tomas-knap commented 11 years ago

Jirko, what is the status? Did you try running the big file? Was it successful?

tomesj commented 11 years ago

I tried to run BIG file ( in the night last yesterday) - but about 3 hours I dont get a result - extraction is on my computer very slowly + I have too little free space in my computer - it may cause the problem.

But extraction is without errors (I try it now always in GUI mode - twice in debug , whice normal => it was practically the same)

I believe if you try to run on your computer you probably have no problems.

There is one example of running: big big

Potencially problem that I found out is, that I have all my pipeline with BIG file in state "running" (after I kill it) and when I run then some pipeline, the pipeline`s datum and time is automatically changed (update) as my last pipeline.

tomas-knap commented 11 years ago

Jirka, but the figure above just shows that the extraction started, which we know that is working from iteration 1. If you are using Virtuoso, you do not need much memory to process the big files, you need just space on your hard drive.

Jirka, you have to be able to run that, it is your responsibility. And also if certain issues occur, how would you react on that? So clean up your HDD a bit or use school computer, e.g., but we cannot close iteration 3 before you test it successfully.

tomas-knap commented 11 years ago

Jirka, you are also not able to solve this issue if you are not able to run the extractor. Please work on that today.

tomesj commented 11 years ago

I can run extractor, but my extraction speed is about 110 000 triples /hour (I made statistic for me and thats number I got). I think that this file has more than 1 000 0000 triples - it about 8 hours on my computer. My free space is about 3GB and it seems to be not enough.

I try to run I last time - when it finished wrong- I must be able to install Tomcat, GitHub, Netbeans,... on our home computer and try to run on it. It may be better.

tomas-knap commented 11 years ago

Well, than increase the free space and test again. Maybe there is this error which generates 5GB file (see top), so I would suggest to have 6GB+ of free space

tomesj commented 11 years ago

I cleaned my space - the speech was much better - about 1mil triples/10 min.

Extraction data from file completed successfully, but it was about 5,5 hours long. It was extracted together 22 407 098 triples :-) I didn´t expect this.

Then I got the same error as you : "Failed to prepare Context for DPURecord because of exception: Can't merge data units."

That say, that problem is somewhere in the context (Petr is responsible for that). I checked my method for merging for sure, but it looks fine :-)

image

skodapetr commented 11 years ago

I'm sorry for this but I'm afraid that this will not be completely on me:

This exception comes from class backend.context.impl.PrimitiveDataUnitMerger .. from

            // and copy the data
            try {
                newDataUnit.merge(rightDataUnit);
            } catch (IllegalArgumentException e) {
                throw new ContextException(
                        "Can't merge data units, type miss match.", e);
            } catch (Exception e) {
                throw new ContextException("Can't merge data units.", e);
            }

This means that the exception is thrown during execution of newDataUnit.merge which is not my code. The code is the same for large or small DataUnits so it looks fine to me as well.

But it's generic Exception .. maybe something with memory? It would be nice to have more info here ..

Jirka do you have test with such huge file, that cover this functionality?

Unfortunately I have only 2G ram and probably not exactly up to date machine .... but still I can try to add more logging functionality .. or put it into file to get more information about original exception and at the end of the day run this scenario on my computer as well.

tomas-knap commented 11 years ago

When looking at newDataUnit.merge(rightDataUnit); implementation, I would suggest to implement Virtuoso specific mergeRepositoryData() method, which will simply execute query like: INSERT INTO {?s ?p ?o} where { graph {?s ?p ?o}}

This will be much more effective than:

for (Statement nextStatement : sourceStatemens) {

                    if (graph != null) {
                        targetConnection.add(nextStatement, graph);
                    } else {
                        targetConnection.add(nextStatement);
                    }
                }
tomas-knap commented 11 years ago

I guess that the problem is with the memory.

tomas-knap commented 11 years ago

Jirka, do not also forget to turn on autocommit on row by row basis and switch off logging, command log_enable(2) I think, but check Virtuoso manual

tomesj commented 11 years ago

I tried to merge DPU manually (as test way) - problem is in fact with memory - its not enough for make merge operation (error on java heap space) :-)

I successfully try to use something like:

RDFInserter inserter=new RDFInserter(targetConnection); inserter.enforceContext(getDataGraph()); sourceConnection.export(inserter, second.getDataGraph());

I compare the speed of this with your proposed solution, Tomas. And better implementation we will use. Its OK ?

tomas-knap commented 11 years ago

Ok, thanks, try and let me know - put the results here.

tomesj commented 11 years ago

I added merge implementation for Virtuoso.

Disadvantage is that is so slowly/quicky as adding data for Virtuoso.

When I tested syntax (see http://www.w3.org/TR/2012/PR-sparql11-update-20121108/#add)

ADD http://graph1 TO http://graph2 // add all triples from graph1 to graph2, if there are not yet = just merge

But I always got: Virtuoso 40005 Error SR325: Transaction aborted because it's log after image size went above the limit

I try to set log_enable value as 3 level [log_enable(3) and log_enable(3,1)], and run it again, but it was the same.

Log_enable using via: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtTipsAndTricksGuideDeleteLargeGraphs

tomas-knap commented 11 years ago

Jirko, can you please provide the number if you have any?

Regarding the logging, it should be I think log_enable(2), which disables logging and enables row by row autocommit. http://docs.openlinksw.com/virtuoso/fn_log_enable.html

tomesj commented 11 years ago

What number did you mind ? log_enable(2) I always set in constructor and use as default :-)

Via: final String JDBC = "jdbc:virtuoso://" + hostName + ":" + port + "/charset=UTF-8/log_enable=2"; VirtuosoRDFRepo virtuosoRepo = new VirtuosoRDFRepo(JDBC, user, password,defaultGraph, dataUnitName);

tomas-knap commented 11 years ago

I thought you said you will measure how which of the variants works, so that is the reason I am expecting some numbers, in terms of the time needed to perform the task, memory consumed etc. At least the time would be good.

tomesj commented 11 years ago

Average speed of copy data was about 280 000 triples/minute (measure triples every 10 minutes and then make average). The total time was 80 minute= 1 hour 20 minute.

A more detailed description I will add to the collab page :-)

tomas-knap commented 11 years ago

Ok, and the other approach we were considering?

tomas-knap commented 11 years ago

Also please do not forget to attach the virtuoso.ini, information about your hardware. It is enough to put this to the report on collab. In that report please compare two or more approaches we were discussing, also insert the fragment of code (it is few lines) so that it is obvious how it was implemented

tomesj commented 11 years ago

OK, the approach I described was using isql from console as "sparql add http://source_graph to http://target_graph"

I tested then the approach using RDFHandler and 2 connection of Virtuoso (source and target) - there we have speed about 14 000 triples/minute (that is 20 times slower)

tomas-knap commented 11 years ago

Ok, great. Please also test the latest version (sparql add) also with the statistical handler enabled, so that we can see the decrease.

tomas-knap commented 11 years ago

Jirka, I was trying to run in normal mode the big ted file and it failed silently, there is no error, but the extractor did not finish and nothing is happening in the backend, see below. Will try again in debug mode.

screen shot 2013-07-24 at 6 29 17 pm

tomesj commented 11 years ago

It is hard to say where is problem, if there is no error.

I let run it in debug mode (result was as I described) - extraction will finish about 5,5 hours (by me). I corrected implementation in merge, but actually we use version using 2 connection (add version with graph is 20 faster, but it goes me just only in command) - I am debugging it for using it in code. If I will be succesfull with that, than I change it.

Actually version is working - you can try it, but it cost lot of time :-)

janvojt commented 11 years ago

I tried running TED-1 pipeline normally (not debug mode). After about 20 minutes, virtuoso db already had cca 3GB. After 45 minutes, extractor finished and appeared in log. After 1 hour and ten minutes, virtuoso db still had about 3GB. Processors were running at about 5-20%. After 2 and a half hours virtuoso db had about 4 GB, nothing else changed. Processors had no load. Virtuoso became unresponsive (I could log in, however whenever I issued a query it just froze). I restarted virtuoso, and found out pipeline was still in running state. Output file was not created at all. BTW I chose TTL format for output file.

tomas-knap commented 11 years ago

Jirko, I thought you have already pushed the changes, so let us know when your update is available, so that we can try it! Where is the problem? Are you still experimenting with the adjusted merge? Jirka, what do you mean with "but it goes me just only in command"? Jan, thanks for trying.

tomesj commented 11 years ago

"As only in command" I mean using as concrete isql execution from cmd. But when I try to connect to Virtuoso SPARQL endpoint (both - with/without authentisation) or something like that (using Vituoso connection - is OK, but very slowly), adding/merging triples of big graph fault. I seach how it use without cause no errors (something like using isql :-)

kukharm commented 11 years ago

I run it yesterday at 24.07.2013 21:48:59 then at 25.07.2013 3:33:32 Extract completed. Now pipeline is still running, and there no record that Loader started. I don't know if it makes sense to wait finishing. Now pipeline runs for about 12 hours.. Output file was not created

tomas-knap commented 11 years ago

Jirka, as I already told, you can be inspired by ODCLeanStore, GraphLoaderUtils.insertRdfFromFile( con, dTTLFile, SerializationLanguage.N3, dataGraphURI, dataBaseUrl);

Excerpt from their code, important steps:

tomas-knap commented 11 years ago

Please work on that today, so that in the evening we can try it out