Performance issues parsing the Freebase data dump

artob commented 11 years ago

The Turtle parser has significant performance issues parsing the latest Freebase data dump. This is reproducible with the following minimal test program:

#!/usr/bin/env ruby
require 'rdf'
require 'rdf/turtle'

count = 0

RDF::Turtle::Reader.open('freebase-rdf-2013-03-24-00-00.ttl') do |reader|
  reader.each_statement do |statement|
    count += 1
    p count if (count % 1_000).zero?
    exit if (count % 100_000).zero?
  end
end

On a Sandy Bridge-EP Xeon E5-1620 3.6GHz processor, it parses about 1,000 triples per second (clearly CPU bound, not I/O bound). At that rate, it would take about 14 days to parse and count the triples in this data dump. That's an issue in and of itself.

The real problem, though, is that when it gets to the 55,000 check point, no more visible progress is made for about 8-9 minutes despite the CPU continuing to run at 100%. At first I thought it had gone into an infinite loop, but it did eventually resume progress.

The next similar "halting problem" came after the 67,000 check point. At that point it had run for 10 minutes in total, and I proceeded to interrupt it; here's the stack trace:

$ time ./parse.rb
1000
...
55000
...
67000
^C
/var/lib/gems/1.9.1/gems/ebnf-0.2.2/lib/ebnf/ll1/parser.rb:466:in `block in skip_until_valid': Interrupt
    from /var/lib/gems/1.9.1/gems/ebnf-0.2.2/lib/ebnf/ll1/parser.rb:464:in `each'
    from /var/lib/gems/1.9.1/gems/ebnf-0.2.2/lib/ebnf/ll1/parser.rb:464:in `inject'
    from /var/lib/gems/1.9.1/gems/ebnf-0.2.2/lib/ebnf/ll1/parser.rb:464:in `skip_until_valid'
    from /var/lib/gems/1.9.1/gems/ebnf-0.2.2/lib/ebnf/ll1/parser.rb:233:in `parse'
    from /var/lib/gems/1.9.1/gems/rdf-turtle-1.0.2/lib/rdf/turtle/reader.rb:272:in `each_statement'
    from ./parse.rb:8:in `block in <main>'
    from /var/lib/gems/1.9.1/gems/rdf-turtle-1.0.2/lib/rdf/turtle/reader.rb:253:in `call'
    from /var/lib/gems/1.9.1/gems/rdf-turtle-1.0.2/lib/rdf/turtle/reader.rb:253:in `block in initialize'
    from /var/lib/gems/1.9.1/gems/rdf-1.0.4/lib/rdf/reader.rb:200:in `instance_eval'
    from /var/lib/gems/1.9.1/gems/rdf-1.0.4/lib/rdf/reader.rb:200:in `initialize'
    from /var/lib/gems/1.9.1/gems/rdf-turtle-1.0.2/lib/rdf/turtle/reader.rb:236:in `initialize'
    from /var/lib/gems/1.9.1/gems/rdf-1.0.4/lib/rdf/reader.rb:139:in `new'
    from /var/lib/gems/1.9.1/gems/rdf-1.0.4/lib/rdf/reader.rb:139:in `block in open'
    from /usr/lib/ruby/1.9.1/open-uri.rb:35:in `open'
    from /usr/lib/ruby/1.9.1/open-uri.rb:35:in `open'
    from /var/lib/gems/1.9.1/gems/rdf-1.0.4/lib/rdf/util/file.rb:43:in `open_file'
    from /var/lib/gems/1.9.1/gems/rdf-1.0.4/lib/rdf/reader.rb:127:in `open'
    from ./parse.rb:7:in `<main>'

real    10m30.760s
user    10m28.883s
sys 0m0.500s

One possible hypothesis is that these are garbage-collection pauses; but if so, they would certainly be the longest GC hickups I've seen. The pauses also happen at the very same check points if the program is re-run.

This was all done with the latest released stable versions of the RDF.rb and RDF::Turtle gems, as shown in the stack trace.

artob commented 11 years ago

Just to continue the report: the test program spent another ~10 minutes burning CPU after each of the 67K, 68K, 69K, and 72K check points as well. It did eventually reach the exit at 100K triples parsed:

...
100000
real    49m2.032s
user    48m50.783s
sys     0m4.520s

danharvey commented 11 years ago

I'm also using this library to parse the freebase dumps, I've not done any benchmarking yet but I've noticed it is slower than I thought it would be. I'll have a try at benchmarking with my code and let you know what I see.

artob commented 11 years ago

I left an improved version of the test program running for an hour. Here's the program:

#!/usr/bin/env ruby
require 'rdf'
require 'rdf/turtle'

STEP = 100
MAX  = 1_000_000

RDF::Turtle::Reader.open('freebase-rdf-2013-03-24-00-00.ttl') do |reader|
  count = 0
  time  = Time.now.to_f
  reader.each_statement do |statement|
    count += 1
    exit if count > MAX
    if (count % STEP).zero?
      time_now = Time.now.to_f
      dt = time_now - time
      puts "%d triples parsed... (%.2fs; %.2fT/s)" % [count, dt, STEP / dt]
      $stdout.flush
      time = time_now
    end
  end
end

The output at the first choke point at around 55,000 triples is as follows:

54900 triples parsed... (0.12s; 828.45T/s) 
55000 triples parsed... (0.12s; 839.07T/s) 
55100 triples parsed... (0.16s; 629.70T/s) 
55200 triples parsed... (0.13s; 772.52T/s) 
55300 triples parsed... (0.16s; 626.27T/s) 
55400 triples parsed... (226.23s; 0.44T/s) 
55500 triples parsed... (0.16s; 629.90T/s) 
55600 triples parsed... (433.82s; 0.23T/s) 
55700 triples parsed... (0.12s; 862.24T/s) 
55800 triples parsed... (0.15s; 660.02T/s) 
55900 triples parsed... (0.12s; 847.14T/s) 
56000 triples parsed... (0.15s; 659.49T/s) 
56100 triples parsed... (0.11s; 888.16T/s)

gkellogg commented 11 years ago

The fact that it's in #skip_until_valid indicates that it is in error recovery mode, having found some illegal syntax. Error recovery is very expensive, and if it does not quickly get to valid turtle, or the inout has a substantial amount of invalid data, this could result in the types of performance issues you're seeing.

My previous tests showed performance to be about 1000 triples/second. Given the regular nature of the dump, this indicates to me that there must be a rather large amount of garbage in the dump, but I'll investigate further.

danharvey commented 11 years ago

I'm using jRuby on a 2.8 Ghz Intel Core i7 MBP, I get the following

50500 triples parsed... (0.11s; 925.93sT/s) 50600 triples parsed... (0.11s; 877.19sT/s) 50700 triples parsed... (12.81s; 7.80sT/s) 50800 triples parsed... (0.18s; 564.97sT/s) 50900 triples parsed... (0.22s; 450.45sT/s) 51000 triples parsed... (18.98s; 5.27sT/s) 51100 triples parsed... (0.17s; 571.43sT/s) 51200 triples parsed... (12.81s; 7.81sT/s) 51300 triples parsed... (0.15s; 684.93sT/s) 51400 triples parsed... (0.13s; 787.40sT/s)

It does seem to be running the JVM GC quite a bit, I've not tried to tune that.

The freebase dump is very messy. They have lots of invalid URIs, literal values and double prefixes. They are aiming for RDF 1.1 compliance but as that's not finalised yet they've got a way to go. Maybe it is just the bad data, I've got scripts to clean it up before I'm using it so I'll see what happens performance wise if I run it through that first.

gkellogg commented 11 years ago

If you could share your cleanup scripts, that would be very useful.

danharvey commented 11 years ago

I can't easily separate out that code right now, but I based it from word done here:

http://people.apache.org/~andy/Freebase20121223/

Which highlights most of the issues with the dump from last year. It keeps changing though, so that's not complete, Google are also actively working on fixing many of these issue.

On 28 March 2013 14:46, Gregg Kellogg notifications@github.com wrote:

If you could share your cleanup scripts, that would be very useful.

— Reply to this email directly or view it on GitHubhttps://github.com/ruby-rdf/rdf-turtle/issues/2#issuecomment-15592311 .

artob commented 11 years ago

@danharvey, thanks for the link to that Freebase data dump clean-up information.

@gkellogg, I'm not sure that there's anything syntactically illegal, per se, in the first 100,000 lines. Raptor hits the first problem only at line 135,351:

$ head -n200000 freebase-rdf-2013-03-24-00-00.ttl | rapper -i turtle -c - http://rdf.freebase.com/ns/
rapper: Parsing file <stdin> with parser turtle and base URI http://rdf.freebase.com/ns/
rapper: Error - URI http://rdf.freebase.com/ns/:135351 - syntax error
rapper: Failed to parse file <stdin> turtle content
rapper: Parsing returned 135343 triples

On the other hand, there are of course the Turtle 2008 vs Turtle 2013 differences to take into account, if your grammar doesn't exactly match whatever Raptor uses.

gkellogg commented 11 years ago

Looking at the dump found an optimization, and turned up a character encoding issue. There are changes to the underlying EBNF gem, which allow the Turtle processor to tell it to reset the parse state at the beginning of each sentence. Otherwise, the LL(1) rules cause it to add a state for each statement.

artob commented 11 years ago

Sounds good, will give this a try once I have a minute. In the meantime, this is how far the test program I left running had progressed:

133600 triples parsed... (0.07s; 1419.42T/s)
133700 triples parsed... (0.20s; 493.41T/s)
133800 triples parsed... (0.07s; 1373.76T/s)
133900 triples parsed... (0.07s; 1413.80T/s)
134000 triples parsed... (0.20s; 497.36T/s)
134100 triples parsed... (0.08s; 1293.71T/s)
134200 triples parsed... (0.07s; 1367.05T/s)
134300 triples parsed... (4778.96s; 0.02T/s)
134400 triples parsed... (9559.31s; 0.01T/s)
134500 triples parsed... (11981.59s; 0.01T/s)
^C
real    765m58.921s
user    763m58.041s
sys     0m17.641s

gkellogg commented 11 years ago

Those numbers are consistent with memory growth slowing down the whole process. After the update, I no longer see long pauses as you experienced before. However, there are some URIs that show up as errors, that Jena passes.

danharvey commented 11 years ago

Thanks this works better for us as well. Freebase also has a HTTP endpoint to this RDF which has different issues, I thought this might help there too but it still seems to break on incorrect lines.

https://www.googleapis.com/freebase/v1/rdf/m/0mwlfn0

I think the issue here is that as it's the short hand turtle format that as soon as it skips the broken line in gets the subject wrong for all preceding lines.

RDF::Statement:0x1902(<ns:m.0mwlfn0 ns:common.topic.article ns:m.0mwryk7 .)> RDF::Statement:0x1904(<ns:m.0mwlfn0 ns:common.topic.description "William Fazan is an actor."@en .)> RDF::Statement:0x1906(<ns:m.0mwlfn0 ns:common.topic.notable_for .)> RDF::Statement:0x1908(<ns:common.topic.notable_types ns:common.topic.notable_types ns:people.person .)> RDF::Statement:0x190a(<ns:common.topic.notable_types ns:common.topic.topic_equivalent_webpage http://www.imdb.com/name/nm0269753/ .)> RDF::Statement:0x190c(<ns:common.topic.notable_types ns:film.actor.film..film.performance.character ns:m.0mwrz3t .)> RDF::Statement:0x190e(<ns:common.topic.notable_types ns:film.actor.film..film.performance.film ns:m.051v_l1 .)> RDF::Statement:0x1910(<ns:common.topic.notable_types ns:film.actor.film ns:m.0mwlfmy .)>

Should I open another ticket for this?

gkellogg commented 11 years ago

It does a best-effort job at error recovery, but it ahouldn't affect things before the error. I found that it does pretty well if you clean up the data first, but it does halt when it should just go into error recovery after abut 500,000 triples. When if finds an incorrect token, it skips to the next token which cab follow the production; eventually, it should hit a "." and start over.

If you can tell me about how far into the live stream it goes south, that would be useful.

Note that this processor may not be the best thing for consuming so much data. It runs at about 1k triples per second, but that's substantially shy of what Jena does, for example. Even the N-Triples processor is only about 10k/s.

The fastest way in Ruby would probably be to adapt Andy's Perl cleanup script into Ruby and output N-Triples directly.

artob commented 11 years ago

It would be a good idea to bump the EBNF dependency to the fixed version, as updating the RDF::Turtle gem by itself didn't yet fix this issue. After updating both gems, I confirm that the choke points are indeed gone. The parsing also appears to be 20-30% faster overall.

gkellogg commented 11 years ago

Actually, this release broke some other stuff, so I'll need to release another :(.

Doing much more on the processing speed will require a fair rewrite, to use a custom, rather than LL(1)-based parser. Something I may do, but it will never reach the speed of N-Triples, and that's not particularly fast either, compared with Redland or Jena. Still, Turtle's pretty important, so it's probably something I'll come back to.

artob commented 11 years ago

I like your EBNF approach, and I'm not sure a hand-rolled Turtle parser in Ruby is worthwhile unless the speed improvement would be at least a couple of orders of magnitude. As you say, Ruby and RDF.rb are not necessarily the right approaches to begin with when dealing with larger volumes of data. But they're damn convenient for everything else, and wouldn't want to compromise the implementation clarity and elegance there.

danharvey commented 11 years ago

@gkellogg the bug goes through to the end of the file from the point it happens.

I agree Ruby isn't a great language to process the whole file, I'm just working on small sections of it at a time which parallelises well. I think a combination of the Jena reader with the rdf.rb API might be a interesting combination for performance and simplicity, I think that's probably quite simple to do with jRuby when I get more time to work on this.

gkellogg commented 11 years ago

As noted in the commit message, there is now a simple Freebase reader which has substantially better performance (I'm measuring about 4K-6K TPS vs 1K TPS). Invoke using :freebase => true as a reader option. If you use the script/parse wrapper within the Gem, add --freebase to the arguments.

This is based on the N-Triples reader, with added patterns for @prefix, PNAME and Boolean.

I have some more error recovery work I want to do (mostly if not entirely in ebnf) before I put out another update.

artob commented 11 years ago

Thanks, Gregg. This does seem to be up to 4x faster parsing the Freebase data dump.

One disparity I can't fully account for is that while the first split (at 10M) and fixed-up input file, xaa.ttl, has 8.28M lines, parsing it with the HEAD version of RDF::Turtle yields only 7.84M statements. That seems like a large diff. I'm assuming invalid statements (e.g. missing the object term) are just silently ignored, and that there must be that many of them per part.

gkellogg commented 11 years ago

That's probably the case; the :validate option shows errors, but terminates, consistent with how N-Triples handles it. Perhaps in both cases, we should log the error and continue. But, logging will consume memory. Printing the error and only raising an exception at the end of processing is probably the way to go.

There could be a large number of marginally non-conforming triples.

danharvey commented 11 years ago

Thanks for adding a separate Freebase reader, quite useful for us to help speed things up a little!

After parsing the whole file using jRuby in Hadoop, I've found that there are some errors using the freebase reader with

expected object in line 20: "208.8." file:rdf-1.0.4/lib/rdf/reader.rb:411

But I think you may be catching all issues now in master?

I was also passing in @prefixes = {...} as I'm processing this in chunks, is it possible to add that as an option for the freebase reader as it was before with the ttl reader?

I'm not sure how much faster it is yet in jRuby with this method, but I'll be able to let you know soon.

gkellogg commented 11 years ago

Yes, I changed behavior on trunk to not abort unless the --validate option is specified (it records to $stderr). I have a bunch of stuff waiting on changes to RDF.rb, and I need to validate against the other gems before I'm ready for a new public release, but I think that RDF::Turtle, itself is pretty stable.

Yes, I can include prefixes passed in through options; I'll do that shortly.

Thanks for the feedback; if you can operate using a bundle Gemfile including the GitHub versions of the various gems, that would be most useful in planning a new release. I'll note this issue on the rdf-turtle commit which adds support for prefix options to the FreebaseReader.

ruby-rdf / rdf-turtle

Performance issues parsing the Freebase data dump #2