openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
870 stars 210 forks source link

Some SPARQL queries are not evaluated correctly if the xsd:string datatype is present #1064

Open galgonek opened 2 years ago

galgonek commented 2 years ago

I use Linked Data Views and I observed that some SPARQL queries are not evaluated correctly if they contain string literals with the xsd:string datatype. Without this datatype (i.e., if simple literals are used), the queries work as expected.

For example, the query —

select ?entry where {
  ?entry :gene / :chromosome "13"^^xsd:string.
  ?entry :isoform / :medical / rdf:type :Disease.
}

— returns the empty result. However, the query —

select ?entry where {
  ?entry :gene / :chromosome "13".
  ?entry :isoform / :medical / rdf:type :Disease.
}

— returns results as expected.

On the other hand, the query —

select ?entry where {
  ?entry :gene / :chromosome "13"^^xsd:string.
  ?entry :isoform / :medical / rdf:type ?type.
}

— works correctly even if the xsd:string is present.

I tried to identify a source of the problem and I found that SQL queries generated by Linked Data Views are correct, but they are incorrectly evaluated in some scenarios. When I tested them on a smaller database, the SQL queries works correctly as their execution plan were different in this case.

Through it all, I was able to prepare the following scenario to demonstrate the issue:

In the database —

create table entry_genes
(
    entry varchar not null,
    gene varchar not null,
    primary key(entry, gene)
);

create table gene_bases
(
    id varchar not null,
    chromosome varchar not null,
    primary key(id)
);

create table isoform_bases
(
    id varchar not null,
    entry varchar not null,
    primary key(id)
);

create table isoform_medicals
(
    isoform varchar not null,
    annotation varchar not null,
    primary key(isoform, annotation)
);

create table annotation_bases
(
    id varchar not null,
    type varchar,
    primary key(id)
);

insert into entry_genes(entry, gene) values ('NX_O00287','ENSG00000133111');
insert into gene_bases(id, chromosome) values ('ENSG00000133111','13');
insert into isoform_bases(id, entry) values ('NX_O00287-1','NX_O00287');
insert into isoform_medicals(isoform, annotation) values ('NX_O00287-1','NX_O00287-1-AN_O00287_000831');
insert into annotation_bases(id, type) values ('NX_O00287-1-AN_O00287_000831','Disease');

— the SQL query —

SELECT __spfi ( 'http://nextprot.org/rdf/entry/%U' ,  "s_11_10_t2"."27b61990~entry") AS "entry"
FROM (SELECT  "s_11_10_t0-int~entry_genes"."entry" AS "606f5d07~entry",  "s_11_10_t0-int~entry_genes"."gene" AS "3019f33b~gene" FROM entry_genes AS "s_11_10_t0-int~entry_genes" TABLE OPTION (hash)) AS "s_11_10_t0"
  INNER JOIN (SELECT  "s_11_10_t1-int~gene_bases"."id" AS "4d5ae72f~id",  "s_11_10_t1-int~gene_bases"."chromosome" AS "f551aa~chromosome" FROM gene_bases AS "s_11_10_t1-int~gene_bases" TABLE OPTION (hash)
     WHERE 
    "s_11_10_t1-int~gene_bases"."chromosome" = (case (coalesce ( UNAME'http://www.w3.org/2001/XMLSchema#string' , 'http://www.w3.org/2001/XMLSchema#string')) when 'http://www.w3.org/2001/XMLSchema#string' then  '13'  else NULL end)) AS "s_11_10_t1"
  ON (
    "s_11_10_t1"."4d5ae72f~id" = "s_11_10_t0"."3019f33b~gene")
  INNER JOIN (SELECT  "s_11_10_t2-int~isoform_bases"."entry" AS "27b61990~entry",  "s_11_10_t2-int~isoform_bases"."id" AS "4de0bdca~id" FROM isoform_bases AS "s_11_10_t2-int~isoform_bases" TABLE OPTION (hash)) AS "s_11_10_t2"
  ON (
    "s_11_10_t2"."27b61990~entry" = "s_11_10_t0"."606f5d07~entry")
  INNER JOIN (SELECT  "s_11_10_t3-int~isoform_medicals"."isoform" AS "26c1a83b~isoform",  "s_11_10_t3-int~isoform_medicals"."annotation" AS "39bb391c~annotation" FROM isoform_medicals AS "s_11_10_t3-int~isoform_medicals" TABLE OPTION (hash)) AS "s_11_10_t3"
  ON (
    "s_11_10_t3"."26c1a83b~isoform" = "s_11_10_t2"."4de0bdca~id")
  INNER JOIN (SELECT  "s_11_10_t4-qm0-c5-int~annotation_bases"."id" AS "9eeb23~id",  "s_11_10_t4-qm0-c5-int~annotation_bases"."type" AS "2879857f~type" FROM annotation_bases AS "s_11_10_t4-qm0-c5-int~annotation_bases" TABLE OPTION (hash)
     WHERE 
   "s_11_10_t4-qm0-c5-int~annotation_bases".type is not null
    AND 
    "s_11_10_t4-qm0-c5-int~annotation_bases"."type" = 'Disease' ) AS "s_11_10_t4-qm0-c5"
  ON (
    "s_11_10_t4-qm0-c5"."9eeb23~id" = "s_11_10_t3"."39bb391c~annotation")
OPTION (QUIETCAST, ORDER);

— returns the empty result. However, the result —

http://nextprot.org/rdf/entry/NX_O00287

— is expected.

If the ORDER option is removed from the end of the SQL query, the correct result is returned!

HughWilliams commented 2 years ago

When I load your tables and generate default RDF Views from the tables all the SQL queries return no data:

SQL> sparql prefix : <http://hfw.openlinksw.com:8890/schemas/git1064/> select ?entry where {   ?entry :gene / :chromosome "13"^^xsd:string.   ?entry :isoform / :medical / rdf:type :Disease. };
entry
LONG VARCHAR
_______________________________________________________________________________

0 Rows. -- 133 msec.
SQL>  sparql prefix : <http://hfw.openlinksw.com:8890/schemas/git1064/> select ?entry where {   ?entry :gene / :chromosome "13".   ?entry :isoform / :medical / rdf:type :Disease. };
entry
LONG VARCHAR
_______________________________________________________________________________

0 Rows. -- 162 msec.
SQL> sparql prefix : <http://hfw.openlinksw.com:8890/schemas/git1064/> select ?entry where {   ?entry :gene / :chromosome "13"^^xsd:string.   ?entry :isoform / :medical / rdf:type ?type. };
entry
LONG VARCHAR
_______________________________________________________________________________

0 Rows. -- 118 msec.
SQL> 

Thus how are you creating your RDF Views ?

What is the version number and gitid of the binary being used, which can be obtained from the output of running virtuoso-t -?

galgonek commented 2 years ago

I do not use the automated generation of Linked Data Views, I have created Linked Data Views manually. However, my database has more than 100 GB, and it is not easy to prepare a small example to demonstrate the issue because the issue depends on a selected execution plan that is typically different for small databases.

But I have observed that the issue is related to the SQL query evaluation. Thus I have prepare the SQL query to demonstrate the issue. Please try the SQL commands mentioned after the sentence "I was able to prepare the following scenario to demonstrate the issue", these commands demonstrate the incorrectly evaluated SQL query.

My version of Virtuoso is 7.2.7.3234-pthreads.

TallTed commented 2 years ago

@galgonek -- Please execute the command virtuoso-t -? and provide the full first stanza of output, which includes both the simple version you've provided and the exact gitid of the codebase from which it was built (as there may be dozens if not hundreds of commits made to the /develop/7 branch between, for instance, 7.2.7.3234 and 7.2.7.3235). You can also get the gitid through a SPARQL query. Knowing this gitid allows us to test backward in the codebase, to see whether and where we may have introduced the issue, as well as forward to test our patch.

galgonek commented 2 years ago

As I just checked, the issue is present in both stable and development versions:

Version 7.2.7.3234-pthreads as of Aug 5 2022 (031118cc2)

Version 7.2.8-dev.3234-pthreads as of Aug 5 2022 (4b1e6fa24)

TallTed commented 2 years ago

Thank you! Those details will be helpful to our ongoing testing and analysis.