openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
853 stars 211 forks source link

`count()` malfunctioning #1053

Open ebremer opened 2 years ago

ebremer commented 2 years ago

I have multiple named graphs loaded into the latest develop Virtuoso. If I do the following query:

prefix : <http://dicom.nema.org/medical/dicom/ns#>
prefix map: <https://map.com/source.tsv/ns/>
select ?MRN where {
graph <https://server.com/a> {?k map:mrn ?MRN} 
graph <https://server.com/b> {?DICOM :00100020 ?MRN; :00080050 ?Accession}
}  limit 100

it will yield plenty of results. just adding count() to the select:

prefix : <http://dicom.nema.org/medical/dicom/ns#>
prefix map: <https://map.com/source.tsv/ns/>
select count(?MRN) where {
graph <https://server.com/a> {?k map:mrn ?MRN} 
graph <https://server.com/b> {?DICOM :00100020 ?MRN; :00080050 ?Accession}
}  limit 100

and i get nothing. Something is not right here.

ebremer commented 2 years ago

Weird, if I drop ":00080050 ?Accession" modifying the query to:

prefix : <http://dicom.nema.org/medical/dicom/ns#>
prefix map: <https://map.com/source.tsv/ns/>
select count(?MRN) where {
graph <https://server.com/a> {?k map:mrn ?MRN} 
graph <https://server.com/b> {?DICOM :00100020 ?MRN}
}  limit 100

I will get a count. This still isn't making much sense.

pkleef commented 2 years ago

@ebremer Is this a publicly accessible dataset, or do you have a public endpoint i can use for a quick experiment?

ebremer commented 2 years ago

Unfortunately, I'm unable to share. Very un-public.

pkleef commented 2 years ago

No problem.

Please follow the steps from the following document:

https://community.openlinksw.com/t/generate-sparql-query-compilation-report-from-a-virtuoso-sparql-endpoint/1046

save the output from the sparql endpoint, check to make sure there is nothing sensitive in the output and then email it to me pkleef@openlinksw.com if you are comfortable.

ebremer commented 2 years ago

@pkleef report has been sent to you.

pkleef commented 2 years ago

@ebremer i received your report. I will discuss the matter within development and let you know the outcome.

ebremer commented 2 years ago

@pkleef any update on this?

ebremer commented 2 years ago

This problem still persists. I have another query which involves a query in the form: select distinct ?DICOM where {complex pattern) and it will yield, say 1.4 million solutions, but adding a count: select count(distinct ?DICOM) where {complex pattern) and it will yield 2.3 millions values. Current running version

virtuoso-t -?
Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.8-dev.3234-pthreads as of Aug 15 2022 (12b3357a6)
Compiled for Linux (x86_64-pc-linux-gnu)

compilation report for failing query:


Virtuoso SPARQL Compilation Report

Original SPARQL query

The SPARQL query as it is passed by web page to the SPARQL compiler:

/*d9724721426f7595338b388e43473299*/
sparql {
define sql:big-data-const 0
#output-format:text/html
define sql:signal-void-variables 1
prefix : <http://dicom.nema.org/medical/dicom/ns#>
prefix epic: <https://stonybrookmedicine.edu/epic/ns/>
prefix loc: <http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/>
prefix dl: <postgresql://bmi-clinical-analytics-p1.uhmc.sunysb.edu:5432/bmi_clinical/sbm_covid19_pacs/ns/covid_positive_visits_with_basic_metadata/>
select count(distinct ?DICOM)
where {
    graph <https://bmi.stonybrookmedicine.edu/radimages> { ?DICOM loc:md5 ?md5 ;  :00080050 ?Accession; :00100020 ?MRN }
    graph <https://stonybrookmedicine.edu/anonfiles> { ?DICOM epic:hasAnonVersion ?hasAnonVersion }
    graph <https://stonybrookmedicine.edu/radimages/anon> {
        ?DICOM  epic:AnonMRN ?AnonMRN;
                epic:DateShift ?DateShift;
                epic:AnonStudyInstanceUID ?AnonStudyInstanceUID;
                epic:AnonSeriesInstanceUID ?AnonSeriesInstanceUID
    }
}
}

Optimized SPARQL query

The SPARQL query after parsing, optimization and converting back into SPARQL

SELECT ( COUNT( DISTINCT ?DICOM))
 WHERE { 
     GRAPH <https://bmi.stonybrookmedicine.edu/radimages> { ?DICOM <http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/md5> ?md5 ;
             <http://dicom.nema.org/medical/dicom/ns#00080050> ?Accession ;
             <http://dicom.nema.org/medical/dicom/ns#00100020> ?MRN . }
     GRAPH <https://stonybrookmedicine.edu/anonfiles> { ?DICOM <https://stonybrookmedicine.edu/epic/ns/hasAnonVersion> ?hasAnonVersion . }
     GRAPH <https://stonybrookmedicine.edu/radimages/anon> { ?DICOM <https://stonybrookmedicine.edu/epic/ns/AnonMRN> ?AnonMRN ;
             <https://stonybrookmedicine.edu/epic/ns/DateShift> ?DateShift ;
             <https://stonybrookmedicine.edu/epic/ns/AnonStudyInstanceUID> ?AnonStudyInstanceUID ;
             <https://stonybrookmedicine.edu/epic/ns/AnonSeriesInstanceUID> ?AnonSeriesInstanceUID . } }

SPARQL query translated to SQL

For security reasons, code responsible for graph-level security is not generated and some account-specific data are intentionally made wrong.

SELECT  COUNT ( DISTINCT
     "s_18_10_t7"."S") AS "callret-0"
FROM DB.DBA.RDF_QUAD AS "s_11_3_t0"
  INNER JOIN DB.DBA.RDF_QUAD AS "s_11_3_t1"
  ON (
    "s_11_3_t1"."S" = "s_11_3_t0"."S")
  INNER JOIN DB.DBA.RDF_QUAD AS "s_11_3_t2"
  ON (
    "s_11_3_t2"."S" = "s_11_3_t1"."S"
    AND 
    "s_11_3_t2"."S" = "s_11_3_t0"."S")
  INNER JOIN DB.DBA.RDF_QUAD AS "s_12_5_t3"
  ON (
    "s_12_5_t3"."S" = "s_11_3_t2"."S"
    AND 
    "s_12_5_t3"."S" = "s_11_3_t1"."S"
    AND 
    "s_12_5_t3"."S" = "s_11_3_t0"."S")
  INNER JOIN DB.DBA.RDF_QUAD AS "s_18_10_t4"
  ON (
    "s_18_10_t4"."S" = "s_12_5_t3"."S"
    AND 
    "s_18_10_t4"."S" = "s_11_3_t2"."S"
    AND 
    "s_18_10_t4"."S" = "s_11_3_t1"."S"
    AND 
    "s_18_10_t4"."S" = "s_11_3_t0"."S")
  INNER JOIN DB.DBA.RDF_QUAD AS "s_18_10_t5"
  ON (
    "s_18_10_t5"."S" = "s_18_10_t4"."S"
    AND 
    "s_18_10_t5"."S" = "s_12_5_t3"."S"
    AND 
    "s_18_10_t5"."S" = "s_11_3_t2"."S"
    AND 
    "s_18_10_t5"."S" = "s_11_3_t1"."S"
    AND 
    "s_18_10_t5"."S" = "s_11_3_t0"."S")
  INNER JOIN DB.DBA.RDF_QUAD AS "s_18_10_t6"
  ON (
    "s_18_10_t6"."S" = "s_18_10_t5"."S"
    AND 
    "s_18_10_t6"."S" = "s_18_10_t4"."S"
    AND 
    "s_18_10_t6"."S" = "s_12_5_t3"."S"
    AND 
    "s_18_10_t6"."S" = "s_11_3_t2"."S"
    AND 
    "s_18_10_t6"."S" = "s_11_3_t1"."S"
    AND 
    "s_18_10_t6"."S" = "s_11_3_t0"."S")
  INNER JOIN DB.DBA.RDF_QUAD AS "s_18_10_t7"
  ON (
    "s_18_10_t7"."S" = "s_18_10_t6"."S"
    AND 
    "s_18_10_t7"."S" = "s_18_10_t5"."S"
    AND 
    "s_18_10_t7"."S" = "s_18_10_t4"."S"
    AND 
    "s_18_10_t7"."S" = "s_12_5_t3"."S"
    AND 
    "s_18_10_t7"."S" = "s_11_3_t2"."S"
    AND 
    "s_18_10_t7"."S" = "s_11_3_t1"."S"
    AND 
    "s_18_10_t7"."S" = "s_11_3_t0"."S")
WHERE
  "s_11_3_t0"."G" = __i2idn ( __bft( 'https://bmi.stonybrookmedicine.edu/radimages' , 1))
  AND 
  "s_11_3_t0"."P" = __i2idn ( __bft( 'http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/md5' , 1))
  AND 
  "s_11_3_t1"."G" = __i2idn ( __bft( 'https://bmi.stonybrookmedicine.edu/radimages' , 1))
  AND 
  "s_11_3_t1"."P" = __i2idn ( __bft( 'http://dicom.nema.org/medical/dicom/ns#00080050' , 1))
  AND 
  "s_11_3_t2"."G" = __i2idn ( __bft( 'https://bmi.stonybrookmedicine.edu/radimages' , 1))
  AND 
  "s_11_3_t2"."P" = __i2idn ( __bft( 'http://dicom.nema.org/medical/dicom/ns#00100020' , 1))
  AND 
  "s_12_5_t3"."G" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/anonfiles' , 1))
  AND 
  "s_12_5_t3"."P" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/epic/ns/hasAnonVersion' , 1))
  AND 
  "s_18_10_t4"."G" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/radimages/anon' , 1))
  AND 
  "s_18_10_t4"."P" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/epic/ns/AnonMRN' , 1))
  AND 
  "s_18_10_t5"."G" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/radimages/anon' , 1))
  AND 
  "s_18_10_t5"."P" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/epic/ns/DateShift' , 1))
  AND 
  "s_18_10_t6"."G" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/radimages/anon' , 1))
  AND 
  "s_18_10_t6"."P" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/epic/ns/AnonStudyInstanceUID' , 1))
  AND 
  "s_18_10_t7"."G" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/radimages/anon' , 1))
  AND 
  "s_18_10_t7"."P" = __i2idn ( __bft( 'https://stonybrookmedicine.edu/epic/ns/AnonSeriesInstanceUID' , 1))
OPTION (QUIETCAST)

SQL execution plan

  { 
    fork {
        RDF_QUAD   2.4e+06 rows(s_12_5_t3.S$28)
         inlined  P =  IRI_ID"...hasAnonVersion"  G =  IRI_ID"...anonfiles" 
        RDF_QUAD         1 rows(s_11_3_t2.S$31)
         inlined  P =  IRI_ID"...00100020"  ,  S = k_s_12_5_t3.S$81 G =  IRI_ID"...radimages" 
        RDF_QUAD         1 rows(s_11_3_t1.S$34)
         inlined  P =  IRI_ID"...00080050"  ,  S = k_s_11_3_t2.S$91 G =  IRI_ID"...radimages" 
        RDF_QUAD         1 rows(s_11_3_t0.S$37)
         inlined  P =  IRI_ID"...cryptographicHashFunctions/md5"  ,  S = k_s_11_3_t1.S$101 G =  IRI_ID"...radimages" 
        RDF_QUAD         1 rows(s_18_10_t4.S$40)
         inlined  P =  IRI_ID"...ns/AnonMRN"  ,  S = k_s_12_5_t3.S$111 G =  IRI_ID"...radimages/anon" 
        RDF_QUAD         1 rows(s_18_10_t7.S$43)
         inlined  P =  IRI_ID"...AnonSeriesInstanceUID"  ,  S = k_s_18_10_t4.S$121 G =  IRI_ID"...radimages/anon" 
        RDF_QUAD         1 rows(s_18_10_t6.S$46)
         inlined  P =  IRI_ID"...AnonStudyInstanceUID"  ,  S = k_s_18_10_t4.S$131 G =  IRI_ID"...radimages/anon" 
        RDF_QUAD         1 rows()
         inlined  P =  IRI_ID"...DateShift"  ,  S = k_s_18_10_t4.S$141 G =  IRI_ID"...radimages/anon" 

        After code:
        0:  count callret-0$53s_18_10_t7.S$43set no set_ctr$50distinct via hash area DISTINCT HASH$57
        5: BReturn 0
      }
    Select (callret-0$53)
  }

Internal optimizer data

These data are primarily for OpenLink support, to get additional details about the query processing.

    QUERY: (line 19) REQUEST TOP NODE (SELECT result-mode):
  RETVALS: ARRAY OF NODES with 1 children: {
    (line 8) FUNCALL:
      FUNCTION NAME: UNAME `SPECIAL::bif:COUNT'
      AGGREGATE MODE: LONG 324
      ARGUMENT: (line 8) VARIABLE: notNULL reference exported
        NAME: UNAME `DICOM' SELECT ID: STRING `s_19_11' TABLE ID: LONG 0    EQUIV: LONG 11
   }
  RETVALS SELECT ID: STRING `s_19_11'
  SOURCES: EMPTY ARRAY
  PATTERN: (line 19) GRAPH PATTERN:WHERE gp
    MEMBERS: ARRAY OF NODES with 8 children: {
      (line 10) TRIPLE:
        GRAPH: (line 10) QNAME:
          IRI: UNAME `https://bmi.stonybrookmedicine.edu/radimages'
        SUBJECT: (line 10) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_11_3_t0'    EQUIV: LONG 11
        PREDICATE: (line 10) QNAME:
          IRI: UNAME `http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/md5'
        OBJECT: (line 10) VARIABLE: notNULL+ (object)
          NAME: UNAME `md5' SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_11_3_t0'    EQUIV: LONG 19
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_11_3_t0'
      (line 10) TRIPLE:
        GRAPH: (line 10) QNAME:
          IRI: UNAME `https://bmi.stonybrookmedicine.edu/radimages'
        SUBJECT: (line 10) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_11_3_t1'    EQUIV: LONG 11
        PREDICATE: (line 10) QNAME:
          IRI: UNAME `http://dicom.nema.org/medical/dicom/ns#00080050'
        OBJECT: (line 10) VARIABLE: notNULL+ (object)
          NAME: UNAME `Accession'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_11_3_t1'    EQUIV: LONG 18
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_11_3_t1'
      (line 10) TRIPLE:
        GRAPH: (line 10) QNAME:
          IRI: UNAME `https://bmi.stonybrookmedicine.edu/radimages'
        SUBJECT: (line 10) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_11_3_t2'    EQUIV: LONG 11
        PREDICATE: (line 10) QNAME:
          IRI: UNAME `http://dicom.nema.org/medical/dicom/ns#00100020'
        OBJECT: (line 10) VARIABLE: notNULL+ (object)
          NAME: UNAME `MRN' SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_11_3_t2'    EQUIV: LONG 17
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_11_3_t2'
      (line 11) TRIPLE:
        GRAPH: (line 11) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/anonfiles'
        SUBJECT: (line 11) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_12_5_t3'    EQUIV: LONG 11
        PREDICATE: (line 11) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/epic/ns/hasAnonVersion'
        OBJECT: (line 11) VARIABLE: notNULL+ (object)
          NAME: UNAME `hasAnonVersion'  SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_12_5_t3'    EQUIV: LONG 16
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_12_5_t3'
      (line 13) TRIPLE:
        GRAPH: (line 12) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/radimages/anon'
        SUBJECT: (line 13) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t4'   EQUIV: LONG 11
        PREDICATE: (line 13) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/epic/ns/AnonMRN'
        OBJECT: (line 13) VARIABLE: notNULL+ (object)
          NAME: UNAME `AnonMRN' SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t4'   EQUIV: LONG 15
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_18_10_t4'
      (line 14) TRIPLE:
        GRAPH: (line 12) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/radimages/anon'
        SUBJECT: (line 13) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t5'   EQUIV: LONG 11
        PREDICATE: (line 14) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/epic/ns/DateShift'
        OBJECT: (line 14) VARIABLE: notNULL+ (object)
          NAME: UNAME `DateShift'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t5'   EQUIV: LONG 14
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_18_10_t5'
      (line 15) TRIPLE:
        GRAPH: (line 12) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/radimages/anon'
        SUBJECT: (line 13) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t6'   EQUIV: LONG 11
        PREDICATE: (line 15) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/epic/ns/AnonStudyInstanceUID'
        OBJECT: (line 15) VARIABLE: notNULL+ (object)
          NAME: UNAME `AnonStudyInstanceUID'    SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t6'   EQUIV: LONG 13
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_18_10_t6'
      (line 17) TRIPLE:
        GRAPH: (line 12) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/radimages/anon'
        SUBJECT: (line 13) VARIABLE: notNULL+ reference+ exported (subject)
          NAME: UNAME `DICOM'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t7'   EQUIV: LONG 11
        PREDICATE: (line 16) QNAME:
          IRI: UNAME `https://stonybrookmedicine.edu/epic/ns/AnonSeriesInstanceUID'
        OBJECT: (line 16) VARIABLE: notNULL+ (object)
          NAME: UNAME `AnonSeriesInstanceUID'   SELECT ID: STRING `s_19_11' TABLE ID: STRING `s_18_10_t7'   EQUIV: LONG 12
        SELECT ID: STRING `s_19_11'
        TABLE ID: STRING `s_18_10_t7'
     }
    FILTERS: EMPTY ARRAY
    SELECT ID: STRING `s_19_11'
    EQUIVS: 11 12 13 14 15 16 17 18 19

EQUIVS:
#0: merged and destroyed
#1: merged and destroyed
#2: merged and destroyed
#3: merged and destroyed
#4: merged and destroyed
#5: merged and destroyed
#6: merged and destroyed
#7: merged and destroyed
#8: merged and destroyed
#9: merged and destroyed
#10: merged and destroyed
#11: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 8 gspo, 1 const, 0 opt, 0 subq: DICOM in s_19_11 s_18_10_t7 s_18_10_t6 s_18_10_t5 s_18_10_t4 s_12_5_t3 s_11_3_t2 s_11_3_t1 s_11_3_t0; notNULL reference exported)
#12: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: AnonSeriesInstanceUID in s_18_10_t7; notNULL)
#13: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: AnonStudyInstanceUID in s_18_10_t6; notNULL)
#14: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: DateShift in s_18_10_t5; notNULL)
#15: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: AnonMRN in s_18_10_t4; notNULL)
#16: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: hasAnonVersion in s_12_5_t3; notNULL)
#17: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: MRN in s_11_3_t2; notNULL)
#18: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: Accession in s_11_3_t1; notNULL)
#19: ( 0 subv (0 bindings, 0 nest.opt.), 0 recv, 1 gspo, 0 const, 0 opt, 0 subq: md5 in s_11_3_t0; notNULL)