ruby-rdf / json-ld

Ruby JSON-LD reader/writer for RDF.rb
The Unlicense
232 stars 27 forks source link

Over a minute to create a graph #46

Closed typhoon2099 closed 5 years ago

typhoon2099 commented 5 years ago

I've installed the latest version of this gem and run it over an HTML, but using JSON::LD::Reader.new takes over a minute to parse the input. I've set logging to false and skip_registry to true (based on other issues that have been closed here) but they don't make any difference to the time it takes to parse the file. By comparison, scanning the same HTML for Microdata takes a fraction of a second.

Am I missing something obvious?

gkellogg commented 5 years ago

Can you provide a reference to the file having the issues? I’ve definitely seen html with thousands of JSON-LD script elements take a long time to load.

typhoon2099 commented 5 years ago

Yes, here you go. The HTML itself has no JSON-LD content, it just takes a long time to realised that.

gkellogg commented 5 years ago

I'm not sure how your using this, but if you call JSON::LD::Reader.new (or JSON::LD::API.expand) referencing the URL, and if it has the Content-Type: text/html header, the API method will treat it as HTML. When I tried this locally (after a small change to script/parse, it parsed the file in 0.036 seconds, and of course returned no content.

I tried the downloaded file using rdf count kdYEZKJx.html and it does take 0.4 seconds to parse 165 statements of RDFa (with several HTML errors noted). That parses using RDF::RDFa::Reader.new.

typhoon2099 commented 5 years ago

Hmm, that's weird. We tried parsing the file here and it took over a minute. We used JSON::LD::Reader.new and added the skip_registry: false option. I'll have to look into this further.

UPDATE: JSON::LD::Reader.new(html) is still taking an age for me. I've stuck timings through our calls and this appears to be the one line that's killing our performance. As I say, the RDF::Reader.for(:microdata).new(html) runs almost instantly, so I can't understand what's happening here.

EDIT: I'm loading the file using file.open and passing it in as a string.

typhoon2099 commented 5 years ago

Right, I've tested a few different things. First, I passed in the HTML as a file, everything went quickly. I then tried passing the HTML in as a StringIO, which started causing JSON::ParserErrors. Finally, I used lib/json/ld/reader.rb:51 exactly to pass into JSON::LD::Reader.new. My tests passed again but were back to being painfully slow.

It seems the second regex on that line is causing some serious slowdown when a string is passed in. When a StringIO is passed in without the regexes then it isn't used and causes a JSON::ParseError, and when a file is loaded in it all runs fine. I'm guessing that the subs are stripping out anything that isn't the LD+JSON lines (it's not the easiest to read), can this be done some other way? The regexes also seem to be catching other CSS and Javascript on the page, causing other JSON::ParserErrors. Would it be more reliable to look for the entire <script type="application/ld+json"> element and strip that down?

gkellogg commented 5 years ago

There is some inconsistency, probably, when passing text, vs. a file descriptor into JSON::LD::Reader#initialize. There was a fairly minimal (probably too minimal) change to support processing HTML documents. When passing a string, it tries to remove everything before and after JSON content, which is no longer called for in the spec. That could be taking some time, although a profile would be needed. Also, when turning into StringIO, there is no content_type available, so no way to know that it's processing HTML rather than JSON. An option to override content_type would be required here. A recent change to script/parse, used for internal testing, sniffs this based on the .html file extension, and creates an instance method for content_type to return text/html in this case.

The parse error on StringIO comes, as I mentioned because of the presumed content_type of application/ld+json. There would be a similar issue with JSON::LD::Reader.open, unless it's retrieved from HTTP, where content_type is set properly.

Note, though, that you'll probably have more success parsing it as RDFa using RDF::RDFa::Reader.new, as RDFa specifically looks for other content embedded in HTML, including JSON-LD. You'd also get RDFa and Microdata triples in addition to the JSON-LD.

Adding a content_type option to Reader.open and Reader.new is probably a good idea, though, as a way to override what it gets or assumes otherwise. This needs to go in RDF.rb RDF::Reader#initialize, though.

typhoon2099 commented 5 years ago

response_container_BBPPID{font-family: initial; font-size:initial; color: initial;}Ah, I didn't realise I could use RDF::RDFa. My current workaround is to parse the file using Nokgiri and look for script elements with the application/ld+json type. Currently quite fast but if I can search for most schema types in one then I'm just going to do that instead.