ruby-rdf / json-ld

Ruby JSON-LD reader/writer for RDF.rb
The Unlicense
232 stars 27 forks source link

Schema data no longer found in a non-deterministic manner #53

Closed typhoon2099 closed 3 years ago

typhoon2099 commented 3 years ago

I have RSpec tests that looks for Product data and returns the first found Product on a page (and tries to merge together solutions to get an Array of image for that Product). This was working on 3.1.8, but after updating to 3.1.9 it seems to fail as the solutions have started coming back with no clear order.

Here's some example HTML:

<html>
<script type="application/ld+json">
  {
    "@context": "https://schema.org",
    "@type": "Product",
    "name": "The first Product",
    "sku": "12345",
    "url": "https://site.com/first-product",
    "image": "https://site.com/first-image.jpg",
    "offers": {
      "@type": "Offer",
      "price": "10",
      "priceCurrency": "CNY",
      "offeredBy": "Bob"
    }
  }
</script>
<script type="application/ld+json">
  {
    "@context": "https://schema.org",
    "@type": "Product",
    "name": "A second product",
    "sku": "67890",
    "url": "https://site.com/second-product",
    "image": [
      "https://site.com/second-image.jpg",
      "https://site.com/third-image.jpg"
    ],
    "offers": {
      "@type": "Offer",
      "price": "100",
      "priceCurrency": "GBP",
      "offeredBy": "Alice"
    }
  }
</script>
<script type="application/ld+json">
  {
    "@context": "https://schema.org",
    "@type": "Product",
    "name": "The first Product",
    "sku": "12345",
    "url": "https://site.com/first-product",
    "image": [
      "https://site.com/first-image.jpg",
      "https://site.com/fourth-image.jpg"
    ],
    "offers": {
      "@type": "Offer",
      "price": "10",
      "priceCurrency": "CNY",
      "offeredBy": "Bob"
    }
  }
</script>
<script type="application/ld+json">
  {
    "@context": "https://schema.org",
    "@type": "Product",
    "name": "The first Product",
    "sku": "12345",
    "url": "https://site.com/first-product",
    "offers": {
      "@type": "Offer",
      "price": "10",
      "priceCurrency": "CNY",
      "offeredBy": "Bob"
    }
  }
</script>
</html>

One of the tests expects to find a URL of https://site.com/first-product, but now fails half the time, returning https://site.com/second-product instead.

Is this a known issue, and if so, is there a way to ensure that returned solutions come back deterministically (ie in the order they're found in the HTML)?

gkellogg commented 3 years ago

If it’s failing to load consistently, the problem is either with the connection or the source location. Note that you can either override the documentLoader with your own custom loader, or configure with the appropriate gem using the RDF::Util::File extensions.

typhoon2099 commented 3 years ago

The source location is a Nokogiri::XML::Element (unchanged between upgrades). I'm getting my solutions using:

PRODUCT_LINKS_QUERY = %(
      PREFIX rsp: <http://rubygems.org/gems/sparql#>
      PREFIX s: <http://schema.org/>

      SELECT ?url ?image ?description
      WHERE {
        { [] ?p s:OfferCatalog } UNION { [] ?p s:ItemList }
        []    s:itemListElement ?item .
        ?item s:url             ?url

        OPTIONAL {
          ?item  s:image/s:url* ?image
          FILTER (!isBlank(?image))
        }
        OPTIONAL { ?item s:name ?description }
      }
  )

(
    RDF::Graph.new << RDF::RDFa::Reader.new(nokogiri_document, base_uri: base_uri)
).query(SPARQL.parse(PRODUCT_DATA_QUERY))

The above code is slightly convoluted as I'm merging different methods together to keep the code smaller.

gkellogg commented 3 years ago

RDF does it guarantee any inherent order to the data, and the default Graph/Repository uses a hash structure that is known for not preserving input order. You might add some ORRDER clauses to the query.

Although it’s not really an effective solution, there is an rdf-ordered-repo gem that will preserve insertion order.

typhoon2099 commented 3 years ago

Okay. I wasn't actually sure if this was a bug or not, more likely we were being lucky that the order was preserved in the first place. Not sure what's changed or why, the diff looks fairly innocouous, I'll have to have a think around how to handle these unusual Graphs (if at all).