ruby-rdf / sparql

Ruby SPARQL library
http://rubygems.org/gems/sparql
The Unlicense
88 stars 14 forks source link

OPTIONAL not behaving as expected #3

Closed ekolvets closed 11 years ago

ekolvets commented 11 years ago

I am running the following sample SPARQL query with an OPTIONAL block and it appears that the bindings in the outer query are not being passed through to the inner query. The solution that ultimately is returned is correct (/people/1), but by not passing binding ?entity to the optional pattern retrieving results from say a SQL backend becomes costly because there is no restriction to a particular context. Would it be possible to include the bindings in queries to optional patterns?

require 'rdf'
require 'sparql'

class TestRepo

  def query(pattern, &block)
    puts "received pattern #{pattern.inspect}"

    statements = []
    if pattern[:predicate].is_a?(RDF::URI) and pattern[:predicate].path == '/attribute_types/first_name'
      statements << RDF::Statement.new(
        :subject   => RDF::URI.new('http://localhost/people/1'),
        :predicate => RDF::URI.new('http://localhost/attribute_types/first_name'),
        :object    => RDF::Literal.new('joe'))
    elsif pattern[:predicate].is_a?(RDF::URI) and pattern[:predicate].path == '/attribute_types/last_name'
      statements << RDF::Statement.new(
        :subject   => RDF::URI.new('http://localhost/people/1'),
        :predicate => RDF::URI.new('http://localhost/attribute_types/last_name'),
        :object    => RDF::Literal.new('smith'))
    elsif pattern[:predicate].is_a?(RDF::URI) and pattern[:predicate].path == '/attribute_types/middle_name'

      statements << RDF::Statement.new(
        :subject   => RDF::URI.new('http://localhost/people/2'),
        :predicate => RDF::URI.new('http://localhost/attribute_types/middle_name'),
        :object    => RDF::Literal.new('blah'))

      statements << RDF::Statement.new(
        :subject   => RDF::URI.new('http://localhost/people/1'),
        :predicate => RDF::URI.new('http://localhost/attribute_types/middle_name'),
        :object    => RDF::Literal.new('blah'))

    end

    statements.each(&block)
  end
end

query = %q(
PREFIX a: <http://localhost/attribute_types/>
  SELECT ?entity
  WHERE {
    ?entity a:first_name 'joe' .
    ?entity a:last_name 'smith' .
    OPTIONAL {
      ?entity a:middle_name 'blah'
    }
  }
)

rep = TestRepo.new(:base_url => 'http://localhost')
sse = SPARQL.parse(query)

solutions = sse.execute(rep)

solutions.each_solution do |s|
  puts s.to_hash
end
gkellogg commented 11 years ago

I can't really see what you think is wrong here. The SSE generated is the following:

(prefix
 ((a: <http://localhost/attribute_types/>))
 (project
  (?entity)
  (leftjoin
   (bgp
    (triple ?entity a:first_name "joe")
    (triple ?entity a:last_name "smith")
   )
   (bgp (triple ?entity a:middle_name "blah"))
  )
 )
)

This clearly show that ?entity is used in both BGP statements. If I run your example, I get the following output:

received pattern {:subject=>?entity, :predicate=>#<http://localhost/attribute_types/first_name>, :object=>"joe"}
received pattern {:subject=><http://localhost/people/1>, :predicate=><http://localhost/attribute_types/last_name>, :object=>"smith"}
received pattern {:subject=>?entity, :predicate=><http://localhost/attribute_types/middle_name>, :object=>"blah", :context=>false}
{:entity=><http://localhost/people/1>}

What's happened, is that the first BGP evaluates and binds ?entity to http://localhost/people/1. The second query now uses that bound value to find the middle name. If you had multiple subjects which matched, you'd see the OPTIONAL BGP run with each of them.

You can see more details of the execution path by adding :debug to the output:

solutions = sse.execute(rep, :debug => true)

  LeftJoin
  =>(left) [#<RDF::Query::Solution:0x3fdb32f6dedc({:entity=>#<RDF::URI:0x3fdb32f6e558(http://localhost/people/1)>})>]
  =>(right) [#<RDF::Query::Solution:0x3fdb32f70cf4({:entity=>#<RDF::URI:0x3fdb32f6d9dc(http://localhost/people/2)>})>, #<RDF::Query::Solution:0x3fdb32f70c68({:entity=>#<RDF::URI:0x3fdb32f6d34c(http://localhost/people/1)>})>]
  =>(merge s1 s2) #<RDF::Query::Solution:0x3fdb32f7072c({:entity=>#<RDF::URI:0x3fdb32f6e558(http://localhost/people/1)>})>
  => [#<RDF::Query::Solution:0x3fdb32f7072c({:entity=>#<RDF::URI:0x3fdb32f6e558(http://localhost/people/1)>})>]
ekolvets commented 11 years ago

In the third call in to #query the pattern[:subject] is equal to a RDF::Query::Variable and not the value bound in the first statement.

gkellogg commented 11 years ago

Okay, I looked into this a bit more. The right side is indeed performed without binding variables from the left, however returned solutions are checked for compatibility when they are merged; that is two solutions are compatible iff the intersection of their elements are eql?. For variables, this includes binding.

Of course, it would be more optimal to bind these before the RHS query is performed, but that is an optimization step. There is a query optimization interface, but it is unimplemented. In this case, it would require re-executing the RHS for each solution returned from the LHS where variables intersect. The current implementation is essentially a straight-forward implementation of the documented algebra.

{ merge(μ1, μ2) | μ1 in Ω1 and μ2 in Ω2, and μ1 and μ2 are compatible and expr(merge(μ1, μ2)) is true }

So, in short, it may not be optimal, but I believe that it is correct. If you have a failing use case, I'll look at it further.