openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
861 stars 210 forks source link

SPARQL: Using a VALUE clause with one value causes query to hang #636

Open drassokhin opened 7 years ago

drassokhin commented 7 years ago

I came across an issue that may actually be related to https://github.com/openlink/virtuoso-opensource/issues/400. When the query attached below is run against an instance of Virtuoso (OpenLink Virtuoso Server Version 07.20.3217-threads for Win64 as of Apr 25 2016) loaded with a large RDF dataset combining the ChEMBL compound pharmacology dataset v.22.1 (see https://www.ebi.ac.uk/rdf/services/chembl/) with our internal datasets, the query hangs (or executes so slowly that it appears to hang). If the VALUES block contains more than one URI (for example, the in the snippet below is un-commented), the same query executes and returns results in under a second. The query will also return results almost immediately even with just one URI in the VALUES block if the '| :hasTargetComponent' alternate matching pattern is removed. As far as I understand, setting the "Enable_joins_only = 1" flag param in the INI file as recommended in responses to issue 400 would not make sense any longer, because it is the default value with 3215+ builds. Please let me know if this is a known issue and whether it is going to be resolved in upcoming releases of Virtuoso server. If not, I will go ahead and collect additional info (db stats, etc) to help pinpoint and resolve the problem. BTW, I think this bug can be reproduced with the Open PHACTS dataset if the chembl graph name in the query is changed to http://www.ebi.ac.uk/chembl, but I have not tried it, since I no longer have an instance of Virtuoso hosting the entire Open PHACTS dataset at hand.

PREFIX : <http://rdf.abcd.com/terms/abcd#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>

select 
?TargetUri
?Name
?TypeName
?TargetComponentNames

from <http://www.ebi.ac.uk/chembl_22_1>
from <http://www.w3.org/2002/07/owl#>
where
{

    ?TargetUri (rdfs:label) ?Name.
    ?TargetUri a ?TypeUri.
    ?TypeUri rdfs:label ?TypeName.

    optional
    {
       select 
       ?TargetUri 
       (group_concat(distinct ?TargetComponentUri; SEPARATOR = ', ') as ?TargetComponentUris)
       (group_concat(distinct ?TargetComponentName; SEPARATOR = ', ') as ?TargetComponentNames)
       where
       {
          ?TargetUri (cco:hasTargetComponent | :hasTargetComponent) ?TargetComponentUri.
          ?TargetComponentUri (rdfs:label | skos:altLabel) ?TargetComponentName.

       } group by ?TargetUri
    }    

    values ?TargetUri
    {
        <http://rdf.ebi.ac.uk/resource/chembl/target/CHEMBL6143>
        #<zzz>
    }
}
HughWilliams commented 7 years ago

@drassokhin: Testing against our local Open PHACTS instance with single values URI or both (i.e. uncommented the value) the query returns one row immediately in both cases:

TargetUri   Name    TypeName    TargetComponentNames
http://rdf.ebi.ac.uk/resource/chembl/target/CHEMBL6143  
"Fumarate hydratase, mitochondrial"
"ChEMBL SingleProtein Class"^^<http://www.w3.org/2001/XMLSchema#string>
CHEMBL_TC_4708, FH, Fumarase, Fumarate hydratase, mitochondrial

Note this instance is running a latest 3218 binary build i.e. develop/7 equivalent from git open source archive, which would be another slight difference to your 3217 build from Apr 2016.

Do you have a Linux instance you can test this on with a latest 3218 develop/7 build, as I seem to recall you build on Linux also now ?

drassokhin commented 7 years ago

@HughWilliams These query optimizer-related issues are often difficult to reproduce. As I have just found out, if the 'from http://www.w3.org/2002/07/owl#' is removed from my query, it also executes fast even with the dummy value commented out. The 'from http://www.w3.org/2002/07/owl#' statement is simply a the query I attached to this incident report is a reduced version of the original, more complicated, query where some assertions from the http://www.w3.org/2002/07/owl# named graph were used. Trying to run the same query against the same dataset in an instance running a more recent build of Virtuoso on a Linux instance would be quite time-consuming, mostly because I have access to just one Linux instance, which is currently being used for other purposes and does not have sufficient hard drive space to accommodate a very large data file containing our dataset (>100GB). I implemented a hack in our software, which inserts a second (dummy non-matching) item into the VALUES block when that block contains exactly one item. This has solved our immediate problem for now, but I will keep my eye on similar issues and report them if I detect certain query patterns that affect query performance. Are you going to have a new version release (with pre-built Windows binaries) in the near future? I guess I can also set up a Windows build environment and experiment with unreleased development versions on Windows machines, since I have a bunch of them at my disposal.

HughWilliams commented 7 years ago

@drassokhin: We are currently preparing to make an open source stable/7 update release from the current develop/7 branch at which point the pre-built Windows binaries will be build and made available for public download. I would expect this to take place in April ...