openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
867 stars 210 forks source link

kNN queries in Virtuoso [Problematic Geometry Evaluation] #748

Open kotheoha opened 6 years ago

kotheoha commented 6 years ago

Hi,

I have loaded in Virtuoso two RDF triple datasets A & B. Both datasets have spatial and non-spatial triples. Regarding spatial triples, A includes POINT, LINESTRING and POLYGON geometies, while B has POINT and MULTIPOINT geometries.

I also have some kNN queries, divided in two classes. The first class C1: points, the second class C2: non-points. The C1 queries ask for the k nearest spatial entities that have a POINT geometry. The C2 queries ask for the k nearest spatial entities that have a non POINT geometry (e.g., LINESTRING, POLYGON in A, MULTIPOINT in B).

I implemented the kNN queries as the last example in link: http://docs.openlinksw.com/virtuoso/rdfsparqlgeospat/ where, instead of DESC, I have ASC, and I have a LIMIT k.

During kNN evaluation on A & B in Virtuoso, I have encountered two problems:

1) For C1 kNN queries (points), the kNN results are not in the right order, and also several of them are missing from the right kNN solution. This happens for A dataset, since on B I execute only non-point kNN queries. 2) For C2 kNN queries (non-points), on A I got a zero result, while on B I got the error: "_Function stdistance() expects a geometry of type 1 as argument 0, not geometry of type 4"

Since Virtuoso supports geometries, as you say in link: http://docs.openlinksw.com/virtuoso/sqlrefgeospatial7enchance/ and I can load geometry triples without problems, it seems that bif:st_distance function does not work right and generates the (1) & (2) problems mentioned above.

I am waiting a formal answer for the (1) & (2) problems, since I will provide the kNN evaluation in Virtuoso to a research paper targeted to a very significant conference.

Best, KT.

TallTed commented 6 years ago

First key question... With what version of Virtuoso are you seeing this behavior? Please provide the first "paragraph" of output from the commandline virtuoso-iodbc-t -? or virtuoso-t -?, or the result of a related SPARQL query.

kotheoha commented 6 years ago

The output from virtuoso-t -? is:

Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.4.2.3217-pthreads as of Jan 30 2018
Compiled for Linux (x86_64-pc-linux-gnu)
Copyright (C) 1998-2016 OpenLink Software

I had installed the Virtuoso branch stable/7.

So, I would like a solution for kNN queries on stable/7, since we have done a series of time-consuming update experiments in that branch, and we have no time to re-do them for another branch (you may propose).

I selected stable/7, since it is the last stable Virtuoso version.,

KT.

TallTed commented 6 years ago

As you will have noticed, the last commit to the stable/7 branch was applied Apr 25, 2016, and this branch is now some 600 commits behind develop/7. We are currently preparing a major update for stable/7 -- which I expect will include most if not all of those 600 develop/7 commits, and possibly a few more. I do not believe we will be able to offer a specific solution for the stable/7 branch as-of-Apr-2016 that does not involve this major update.

I would suggest that you test with a build from the current develop/7 tree, to see whether this issue persists. If not, we may be able to point to a specific commit which resolved this, which you may be able to apply to your local branch for your own build. This is not optimal, because you will no longer be in sync with the main tree, but it may be sufficient for your immediate needs.

kotheoha commented 6 years ago

I have checked kNN queries in develop/7 edition.

Now, the output from virtuoso-t -? is:

Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.5-rc1.3217-pthreads as of Jun 16 2018 (000000)
Compiled for Linux (x86_64-pc-linux-gnu)
Copyright (C) 1998-2018 OpenLink Software

The problem persists! The only thing that changed is that instead of getting the error: "_Function stdistance() expects a geometry of type 1 as argument 0, not geometry of type 4" (mentioned above for B), I get a zero result, as also happened in A.

So, as a general conclusion, it seems that bif:st_distance function (included in kNN queries), during kNN evaluation, ignores "POLYGON", "LINESTRING" and "MULTIPOINT" geometries. I predict that for all non-point geometries it does the same. Meanwhile, even for POINT geometries, kNN results (as mentioned) is not in the right order.

It would be good to correct this bug and support right all the VIRTUOSO geometries for basic spatial query evaluations (e.g., such as kNN).

KT.

TallTed commented 6 years ago

Thank you for the additional information. It will help speed our efforts, if you can provide us with (the smallest possible subset of) the datasets and the (simplest) queries that demonstrate these issues.

If there's any concern about data confidentiality, you can use our Support Center to create a case, and submit these extra details directly (instead of posting them publicly). We can also process an NDA for this purpose, if appropriate.

kotheoha commented 6 years ago

We took two different samples from LGD repository 2015-11-02.

This dataset has only POINT and LINESTRING geometries. So, we converted some of the LINESTRING geometries into POLYGON and MULTIPOINT geometries. A & B mentioned above are two different variations of LGD.

Below, I provide an indicative kNN query:

SPARQL
SELECT DISTINCT ?s (bif:st_distance(?g, bif:st_point(-2.5, 52.5)))
FROM <http://LGD.com>
WHERE
{
?s <http://linkedgeodata.org/ontology/version> ?v .
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://linkedgeodata.org/ontology/Restaurant> .
?s <http://www.w3.org/2003/01/geo/wgs84_pos#geometry> ?g .
FILTER (bif:st_intersects(?g, bif:st_point(-2.5, 52.5), 20))
}
ORDER BY ASC 2
LIMIT 5;

, which asks for the 5 nearest restaurants based on bif:st_point(-2.5, 52.5).

The http://LGD.com is the graph name of the loaded LGD database. We had http://LGD_A.com and http://LGD_B.com for A and B loaded LGDs correspondingly.

KT.

TallTed commented 6 years ago

Forgive me if I was unclear.

I believe I understand your explanation of the issue, but in order to address it properly and quickly, we need a quick path to test -- dataset(s) to load, query(ies) to execute, and an idea of, if not the exact, expected results -- both to reproduce the issue, and to test our fix(es).

Every step you can save us toward that reproduction, will speed the process. That includes being explicit about which samples you took from the LGD repository, and if at all possible providing the dataset you actually loaded, including your conversions from LINESTRING geometries into POLYGON and MULTIPOINT geometries.

Given the current state of work on the upcoming release, I do not think this issue will be resolved therein -- but it is possible, if we have an exact step-by-step "load this, execute that, compare to this" to work from -- and even if this fix is not in the immediate update, such specifics will make this fix come much sooner than if we have to create the dataset, etc.

kotheoha commented 6 years ago

Ok,

I can provide an analytic guide for you to find and correct this bug. Still, I am on a June end deadline, and I estimate I can provide my answer in 10 days from now.

Apart from queries, I will also provide the exact datasets. The first has 2.5 GB, the second has 18 GB. Would you provide me with a confidential Virtuoso repository to upload those datasets?

I provide my answer in almost 10 days from now, KT.

TallTed commented 6 years ago

I am looking into an appropriate upload destination for these large datasets.

I am sorry to keep asking for extra details, but can you please provide the output of the following SPARQL query against the Virtuoso 7.2.4.2.3217 instance you performed most of your testing on, and again against the newly built 7.2.5-rc1.3217?

SELECT
  ( bif:sys_stat('st_dbms_ver')   AS ?version )
  ( bif:sys_stat('st_build_date') AS ?build_date )
  ( bif:sys_stat('git_head')      AS ?git_head )   
WHERE  {  ?s  ?p  ?o  }  LIMIT 1
kotheoha commented 6 years ago

Version 7.2.4.2.3217-pthreads output: version: 07.20.3217 build_date: Jan 30 2018 git_head: No system status variable git_head

Version 7.2.5-rc1.3217-pthreads output: version: 07.20.3217 build_date: Jun 16 2018 git_head: 000000

KT.