gkellogg / rdf_context

Ruby RDF package with contextual graphs, memory and persistent datastores and compliant RDF/XML, RDFa and N3 parsers. (Deprecated, please see RDF.rb https://github.org/gkellogg/rdf and my other related gems)
24 stars 5 forks source link

Parsing very slow on larger files #3

Open ijdickinson opened 14 years ago

ijdickinson commented 14 years ago

I'm reading in a bunch of RDF files, each into their own RdfContext::Graph. The results below show the timings I'm getting. Small files load just fine; larger files take disproportionately long. One file takes 8.5 minutes to load 38k triples. I'm running on a quad-core 64 bit Ubuntu system with 8Gb memory and using ruby 1.9.1, so I don't think the raw performance of the machine is an issue.

log file output:

loading concept definitions... Initializing coins_concept with target/def/sector.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/sector Initializing coins_concept with target/def/data-type.nt ... parsing complete in 1.6s producing 487 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/data-type Initializing coins_concept with target/def/programme-admin.nt ... parsing complete in 0.2s producing 47 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-admin Initializing coins_concept with target/def/cga-body-type.nt ... parsing complete in 0.2s producing 47 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/cga-body-type Initializing coins_concept with target/def/resource-capital.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/resource-capital Initializing coins_concept with target/def/pesa-transfer.nt ... parsing complete in 0.3s producing 87 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-transfer Initializing coins_concept with target/def/account-code.nt ... parsing complete in 20.2s producing 4711 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/account-code Initializing coins_concept with target/def/estimate-number.nt ... parsing complete in 2.5s producing 503 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number Initializing coins_concept with target/def/cofog.nt ... parsing complete in 4.5s producing 1271 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/cofog Initializing coins_concept with target/def/department-code.nt ... parsing complete in 3.3s producing 847 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/department-code Initializing coins_concept with target/def/budget-capital-current.nt ... parsing complete in 0.3s producing 47 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/budget-capital-current Initializing coins_concept with target/def/request-for-resources-next-year.nt ... parsing complete in 0.2s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources-next-year Initializing coins_concept with target/def/counterparty-code.nt ... parsing complete in 1.7s producing 431 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/counterparty-code Initializing coins_concept with target/def/pesa-delivery.nt ... parsing complete in 0.1s producing 31 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-delivery Initializing coins_concept with target/def/income-category.nt ... parsing complete in 0.5s producing 111 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/income-category Initializing coins_concept with target/def/estimate-line.nt ... parsing complete in 2.1s producing 615 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line Initializing coins_concept with target/def/programme-object-group-code.nt ... parsing complete in 125.7s producing 15895 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-object-group-code Initializing coins_concept with target/def/estimates-aina.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimates-aina Initializing coins_concept with target/def/estimates-capital-current.nt ... parsing complete in 2.1s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimates-capital-current Initializing coins_concept with target/def/activity-code.nt ... parsing complete in 6.0s producing 1375 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/activity-code Initializing coins_concept with target/def/estimate-number-next-year.nt ... parsing complete in 2.4s producing 503 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number-next-year Initializing coins_concept with target/def/accounting-authority.nt ... parsing complete in 0.9s producing 159 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/accounting-authority Initializing coins_concept with target/def/pesa-current-grants.nt ... parsing complete in 1.0s producing 215 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-current-grants Initializing coins_concept with target/def/estimate-line-next-year.nt ... parsing complete in 2.8s producing 615 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line-next-year Initializing coins_concept with target/def/request-for-resources.nt ... parsing complete in 0.2s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources Initializing coins_concept with target/def/pesa-services.nt ... parsing complete in 0.4s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-services Initializing coins_concept with target/def/estimate-line-last-year.nt ... parsing complete in 2.6s producing 575 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line-last-year Initializing coins_concept with target/def/nac.nt ... parsing complete in 4.0s producing 951 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/nac Initializing coins_concept with target/def/estimate-number-last-year.nt ... parsing complete in 2.5s producing 495 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number-last-year Initializing coins_concept with target/def/budget-boundary.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/budget-boundary Initializing coins_concept with target/def/pesa-1.1.nt ... parsing complete in 0.1s producing 31 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-1.1 Initializing coins_concept with target/def/esa.nt ... parsing complete in 2.6s producing 543 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/esa Initializing coins_concept with target/def/territory.nt ... parsing complete in 0.2s producing 71 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/territory Initializing coins_concept with target/def/data-subtype.nt ... parsing complete in 2.3s producing 471 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/data-subtype Initializing coins_concept with target/def/department-group.nt ... parsing complete in 2.1s producing 439 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/department-group Initializing coins_concept with target/def/signage.nt ... parsing complete in 0.1s producing 31 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/signage Initializing coins_concept with target/def/request-for-resources-last-year.nt ... parsing complete in 0.4s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources-last-year Initializing coins_concept with target/def/programme-object-code.nt ... parsing complete in 513.3s producing 38855 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-object-code Initializing coins_concept with target/def/sbi.nt ... parsing complete in 8.1s producing 455 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/sbi Initializing coins_concept with target/def/time.nt ... parsing complete in 0.8s producing 119 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/time Total time taken 720.9s

The files are in n-triples format: I also tried with Turtle input but gave up after waiting too long! I've tried with :list_store and :memory_store, it doesn't make much difference.

My guess is that something in the parser loop is not scaling linearly with the size of the input file, but that's just a guess. I don't think there's anything special about the input files themselves, but am happy to provide copies if that helps with debugging.

Ian

gkellogg commented 14 years ago

The SQLite3 store will provide persistent storage, and may scale better for even larger graphs, but it is slower for smaller graphs. That would be :store => SQLite3.new(:path => "store.db"). You may have also found a memory leak within the Parser. The NTriples parser is the same as the Turtle/N3, so that could be an issue. Do you have the same problem parsing large files in other serializations?

If you have a script to run through these, I'll check it out.

Also, note that the same parsers and serializers in RdfContext are also available through RDF.rb as rdf-rdfa, rdf-n3 and rdf-rdfxml. RDF.rb has a richer infrastructure for graph storage than RdfContext. I've also noticed that RDF/XML parsing is substantially faster, due to some underlying optimizations in that implementation.