ruby-rdf / sparql

Ruby SPARQL library
http://rubygems.org/gems/sparql
The Unlicense
88 stars 14 forks source link

VAULES position looks wrong #41

Closed manabuishii closed 2 years ago

manabuishii commented 2 years ago

Hello

I use rdf/sparql .

When I use to_sparql() , VAULES position looks wrong .

3.2.1 (release version) VAULES position looks wrong.

3.2.1 (git 2216ee3f20ca55db17a56ec6584b53aac9fe8b04) VAULES position looks wrong.

3.2.0 (exactly git 2b484da3affdaca6e867a15b491927d64abdc2f9) VALUES poistion looks fine đź‘Ť

I test with this SPARQL

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT DISTINCT ?parent ?child ?child_label
FROM <http://rdf.integbio.jp/dataset/togosite/ensembl>
WHERE {
  ?enst obo:SO_transcribed_from ?ensg .
  ?ensg a ?parent ;
        obo:RO_0002162 taxon:9606 ;
        faldo:location ?ensg_location ;
        dc:identifier ?child ;
        rdfs:label ?child_label .
  FILTER(CONTAINS(STR(?parent), "terms/ensembl/"))
  BIND(STRBEFORE(STRAFTER(STR(?ensg_location), "GRCh38/"), ":") AS ?chromosome)
  VALUES ?chromosome {
      "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
      "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
      "X" "Y" "MT"
  }
}

I test with this code.

require "sparql"

# SPARQL
endpoint = "https://integbio.jp/togosite/sparql"
rq = <<'SPARQL'.chop
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT DISTINCT ?parent ?child ?child_label
FROM <http://rdf.integbio.jp/dataset/togosite/ensembl>
WHERE {
  ?enst obo:SO_transcribed_from ?ensg .
  ?ensg a ?parent ;
        obo:RO_0002162 taxon:9606 ;
        faldo:location ?ensg_location ;
        dc:identifier ?child ;
        rdfs:label ?child_label .
  FILTER(CONTAINS(STR(?parent), "terms/ensembl/"))
  BIND(STRBEFORE(STRAFTER(STR(?ensg_location), "GRCh38/"), ":") AS ?chromosome)
  VALUES ?chromosome {
      "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
      "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
      "X" "Y" "MT"
  }
}
SPARQL

# # convert
parsedobject = SPARQL.parse(rq)
rqfromparsedobject = parsedobject.to_sparql()

puts "#{rqfromparsedobject}"

3.2.1 looks wrong version output like hits. WHERE clause is closed before VALUES. The difference release 3.2.1 and 3.2.1 git 2216ee3f20ca55db17a56ec6584b53aac9fe8b04 (current latest) is , very small spaces and blank line.

relase 3.2.1

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT DISTINCT ?parent ?child ?child_label
FROM <http://rdf.integbio.jp/dataset/togosite/ensembl>
WHERE {
{
{?enst obo:SO\_transcribed\_from ?ensg. 
?ensg a ?parent. 
?ensg obo:RO\_0002162 taxon:9606. 
?ensg faldo:location ?ensg_location. 
?ensg dc:identifier ?child. 
?ensg rdfs:label ?child_label
BIND (STRBEFORE(STRAFTER(str(?ensg_location), "GRCh38/"), ":") AS ?chromosome) .}
FILTER (contains(str(?parent), "terms/ensembl/")) .
}
}
VALUES (?chromosome) {
("1")
("2")
("3")
("4")
("5")
("6")
("7")
("8")
("9")
("10")
("11")
("12")
("13")
("14")
("15")
("16")
("17")
("18")
("19")
("20")
("21")
("22")
("X")
("Y")
("MT")
}

git 2216ee3f20ca55db17a56ec6584b53aac9fe8b04 (current latest)

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT DISTINCT ?parent ?child ?child_label
FROM <http://rdf.integbio.jp/dataset/togosite/ensembl>
WHERE {
{
{?enst obo:SO\_transcribed\_from ?ensg . 
?ensg a ?parent . 
?ensg obo:RO\_0002162 taxon:9606 . 
?ensg faldo:location ?ensg_location . 
?ensg dc:identifier ?child . 
?ensg rdfs:label ?child_label . 

BIND (STRBEFORE(STRAFTER(str(?ensg_location), "GRCh38/"), ":") AS ?chromosome) .}
FILTER (contains(str(?parent), "terms/ensembl/")) .
}
}
VALUES (?chromosome) {
("1")
("2")
("3")
("4")
("5")
("6")
("7")
("8")
("9")
("10")
("11")
("12")
("13")
("14")
("15")
("16")
("17")
("18")
("19")
("20")
("21")
("22")
("X")
("Y")
("MT")
}

3.2.0 fine version output looks like this. WHERE clause is closed include VALUES.

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT DISTINCT ?parent ?child ?child_label
FROM <http://rdf.integbio.jp/dataset/togosite/ensembl>
WHERE {
{
{?enst obo:SO\_transcribed\_from ?ensg .
?ensg a ?parent .
?ensg obo:RO\_0002162 taxon:9606 .
?ensg faldo:location ?ensg_location .
?ensg dc:identifier ?child .
?ensg rdfs:label ?child_label .
BIND (STRBEFORE(STRAFTER(str(?ensg_location), "GRCh38/"), ":") AS ?chromosome) .}
FILTER (contains(str(?parent), "terms/ensembl/")) .
VALUES (?chromosome) {
("1")
("2")
("3")
("4")
("5")
("6")
("7")
("8")
("9")
("10")
("11")
("12")
("13")
("14")
("15")
("16")
("17")
("18")
("19")
("20")
("21")
("22")
("X")
("Y")
("MT")
}

}
}
manabuishii commented 2 years ago

Additional information:

I test with git bisect.

I think fa38ee0 is first VALUES position change commit

git bisect start 2216ee3f20ca55db17a56ec6584b53aac9fe8b04 2b484da3affdaca6e867a15b491927d64abdc2f9
git bisect run bundle exec ruby values_position_invalid.rb

values_position_invalid.rb

require_relative "lib/sparql"
require 'digest/md5'

# SPARQL
endpoint = "https://integbio.jp/togosite/sparql"
rq = <<'SPARQL'.chop
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX taxon: <http://identifiers.org/taxonomy/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT DISTINCT ?parent ?child ?child_label
FROM <http://rdf.integbio.jp/dataset/togosite/ensembl>
WHERE {
  ?enst obo:SO_transcribed_from ?ensg .
  ?ensg a ?parent ;
        obo:RO_0002162 taxon:9606 ;
        faldo:location ?ensg_location ;
        dc:identifier ?child ;
        rdfs:label ?child_label .
  FILTER(CONTAINS(STR(?parent), "terms/ensembl/"))
  BIND(STRBEFORE(STRAFTER(STR(?ensg_location), "GRCh38/"), ":") AS ?chromosome)
  VALUES ?chromosome {
      "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
      "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
      "X" "Y" "MT"
  }
}
SPARQL

# # convert
parsedobject = SPARQL.parse(rq)
rqfromparsedobject = parsedobject.to_sparql()

#puts "#{rqfromparsedobject}"
md5= Digest::MD5.hexdigest(rqfromparsedobject)
# f02f and 34e are alomost same difference is that have spaces or not.
if  md5 == "f02f4316024b5383c70ac5e5e62efbca" or md5 == "34e2637fc29d3efe37c077390b850466"
  exit 0
else
  exit 1
end
$ git bisect start 2216ee3f20ca55db17a56ec6584b53aac9fe8b04 2b484da3affdaca6e867a15b491927d64abdc2f9                 
Previous HEAD position was 6ddab97 Fix some escaping use cases in to_sparql with RDF and SXP gem updates.
Switched to branch 'develop'
Your branch is up to date with 'origin/develop'.
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[80334394ec41ae170efcd71c17c7f4ca9eecf211] Give up on complex sub-select with slice use case round-trip.

run

$ git bisect run bundle exec ruby values_position_invalid.rb
running bundle exec ruby values_position_invalid.rb
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[c98ecf7614950a91a1e9b50efcc42a4222c531da] More sub-select use cases.
running bundle exec ruby values_position_invalid.rb
Bisecting: 1 revision left to test after this (roughly 1 step)
[6ddab97d9c86813c40ef882079f1108ba1fe2e1b] Fix some escaping use cases in to_sparql with RDF and SXP gem updates.
running bundle exec ruby values_position_invalid.rb
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[fa38ee0e5c425a0562dca528cc8889e1634e9542] Serialize VALUES at the top-level if the join is at the top-level.
running bundle exec ruby values_position_invalid.rb
fa38ee0e5c425a0562dca528cc8889e1634e9542 is the first bad commit
commit fa38ee0e5c425a0562dca528cc8889e1634e9542
Author: Gregg Kellogg <gregg@greggkellogg.net>
Date:   Mon Jan 17 14:45:15 2022 -0800

    Serialize VALUES at the top-level if the join is at the top-level.

 lib/sparql/algebra/operator.rb       |  6 +++
 lib/sparql/algebra/operator/join.rb  |  8 +++-
 lib/sparql/algebra/operator/table.rb |  8 +++-
 spec/algebra/to_sparql_spec.rb       | 81 ++++++++++++++++++------------------
 spec/suite_spec.rb                   | 19 ++++-----
 5 files changed, 69 insertions(+), 53 deletions(-)
bisect run success
gkellogg commented 2 years ago

Although VALUES can appear within a query (is like a join) the grammar shows it at the same level as ORDER, GROUP and similar clauses after the WHERE clause. In fact, it is pretty much impossible to tell from the way the query is parsed into SSE which way it was originally expressed. Generally, the test cases use VALUES after the WHERE clause. It certainly shouldn’t affect the query operation.