ontop / ontop

Ontop is a platform to query relational databases as Virtual RDF Knowledge Graphs using SPARQL
https://ontop-vkg.org
Apache License 2.0
653 stars 163 forks source link

Ontop materialize produces empty result on table without pkey and with null values #395

Open tirrolo opened 3 years ago

tirrolo commented 3 years ago

Description

Whenever a table without primary key is encountered, the Ontop bootstrapper will generate a bnode template comprising all attributes of the table. However, if such table contain null values, then this means that no output rdf triple will be generated by ontop materialize with respect to rows containing nulls (this seems to follow the mechanics adopted for URIs, that cannot be constructed out of null values... however bnodes templates should be treated differently than URI templates, since bnodes are not URIs).

Steps to Reproduce

1) Create the DB "bootstrapper-tests-bnode" (already on obdalin) with the following definition:

create table i (
    a integer,
    b integer
);

insert into i values (1,null);
insert into i values (null,2);

Bootstrap the ontology and mappings. For your convenience, here they are:

Ontology:

<?xml version="1.0"?>
<rdf:RDF xmlns="http://purl.org/bnodes/onto#"
     xml:base="http://purl.org/bnodes/onto"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:xml="http://www.w3.org/XML/1998/namespace"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
    <owl:Ontology rdf:about="http://purl.org/bnodes/onto"/>

    <!-- 
    ///////////////////////////////////////////////////////////////////////////////////////
    //
    // Data properties
    //
    ///////////////////////////////////////////////////////////////////////////////////////
     -->

    <!-- http://purl.org/bnodes/onto/i#a -->

    <owl:DatatypeProperty rdf:about="http://purl.org/bnodes/onto/i#a"/>

    <!-- http://purl.org/bnodes/onto/i#b -->

    <owl:DatatypeProperty rdf:about="http://purl.org/bnodes/onto/i#b"/>

    <!-- 
    ///////////////////////////////////////////////////////////////////////////////////////
    //
    // Classes
    //
    ///////////////////////////////////////////////////////////////////////////////////////
     -->

    <!-- http://purl.org/bnodes/onto/i -->

    <owl:Class rdf:about="http://purl.org/bnodes/onto/i"/>
</rdf:RDF>

<!-- Generated by the OWL API (version 4.5.9) https://github.com/owlcs/owlapi -->

Mapping:

[PrefixDeclaration]
obda:       https://w3id.org/obda/vocabulary#
xsd:        http://www.w3.org/2001/XMLSchema#
owl:        http://www.w3.org/2002/07/owl#
rdf:        http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:       http://www.w3.org/2000/01/rdf-schema#

[MappingDeclaration] @collection [[
mappingId   MAPPING-ID1
target      _:ontop-bnode-1/{a}/{b} a <http://purl.org/bnodes/onto/i> ; <http://purl.org/bnodes/onto/i#a> {a}^^xsd:integer ; <http://purl.org/bnodes/onto/i#b> {b}^^xsd:integer . 
source      SELECT * FROM "i"
]]

3) Run Ontop materialize:

./ontop materialize -p onto/bnodes/bnodes.properties -m onto/bnodes/bnodes.obda -t onto/bnodes/bnodes.owl -o bnodes.ttl

Expected behavior: Some non-empty output

Actual behavior: Empty RDF file.

Reproduces how often: Always

Attached material

Everything is provided above.

Versions

Ontop CLI 4.0.3

Additional Information

Any additional information, configuration or data that might be necessary to reproduce the issue.

bcogrel commented 3 years ago

Hi @tirrolo ,

That's an interesting problem.

According to R2RML, section 7.3, if an argument of a template is null, the template value becomes null, whether it is an IRI template or something else.

Do you know if the Direct Mapping specification suggests something for this situation?

One first idea from my side would be to use a COALESCE in the source part of the mapping assertion so as to replace the null value by a default value (e.g. the 'null' string). This may be imperfect since multiple rows could produce the same bnode, which I don't know if it is tolerated by the Direct Mapping specification or not.

Best, Benjamin

kontchakov commented 3 years ago

Hi Benjamin, Davide

The Direct Mapping specification (https://www.w3.org/TR/rdb-direct-mapping/#no-pk) does not prescribe any way of creating blank nodes (no template), it merely says that each row of the table should get a fresh blank node assigned to it. This makes perfect sense, but it means that we need some sort of surrogate ID in that case. However, all attributes, as Davide highlights, cannot be used as such an ID - a row number is required instead, but that makes it difficult to support it in general. DBMSs use various ways of producing row IDs, and keeping those IDs consistent for references (https://www.w3.org/TR/rdb-direct-mapping/#ref-no-pk) may even be impossible.

A practical solution could be using COALESCE(TYPE2TXT(column), 'NULL'), but that will be close to the weird treatment of NULLs in early MS SQL Server, where all NULLs in any given column were "equal" (so, a UNIQUE nullable column could contain at most one NULL).

Best Roman

bcogrel commented 3 years ago

Hi,

Thanks Roman for the explanation. We encounter here the same difficulties as with the request for supporting anonymous blank nodes in the mapping (#372, not part of R2RML).

Unfortunately, the practical solution based on COALESCE we are having in mind does not comply with the Direct Mapping specification. The approach of row ids seems to be a better way, but indeed hard to support as DBMSs handle them in very different ways, when they provide this feature.

Best, Benjamin

bcogrel commented 3 years ago

Following our online discussion this morning, the Ontop bootstrapper does and will not align itself with the Direct Mapping spec, but with the default mapping mechanism of R2RML (https://www.w3.org/TR/r2rml/#default-mappings). We should made explicit that duplicate rows are not preserved.

In that setting, it should be OK to replace nulls by default values using coalesce. That will be the responsibility of the bootstrapper, not of the R2RML mapping processor. The choice of default values should be made configurable.

bcogrel commented 3 years ago

One idea of having one single default value to specify: casting all arguments of the template to string before applying the coalesce.

kontchakov commented 3 years ago

That's precisely what I wrote above, with COALESCE(TYPE2TXT(column), 'NULL'). I mentioned it's weird, but actually it's not that bad: the template will contain all columns, so, the only effect of this COALESCE is indeed elimination of duplicate rows.

To properly deal with duplicate rows, we'll need to extend the SQLParser to understand the "pseudo-columns" like ROWID (which will make it more dependant on the SQL dialect, which is probably inevitable at some stage anyway).

tirrolo commented 3 years ago

Yes guys, I can handle this in the way you suggest (i.e., ignore rowids) in the new bootstrapper branch, because I have a configuration file there. However I am not sure at all we will eventually want to merge that into Ontop: that bootstrapper is getting quite complex and for such reason I was in fact thinking of taking it out of the Ontop codebase. For sure, before considering any merging, it requires a substantial refactoring effort (I have spent the latest months in rushing adding new features, and as a result the code looks ugly).