i40-Tools / OntEnrich

Application for knowledge graph enrichment through linking with DBpedia.
MIT License
1 stars 0 forks source link

Cleaning rdf:type range in the enriched STO #1

Open mmaltsev opened 6 years ago

mmaltsev commented 6 years ago

In the DBpedia, in our area of interest, range of the property rdf:type sometimes consists of irrelevant data.

Example can be sto:BBF_TR-069 -- rdf:type -- dbpcy:Rule106652242. In this case, the unrelated object dbpcy:Rule106652242 is just a result of implementing the predicate rdfs:subClassOf to dbpcy:Protocol106665108.

Thus, we have the full chain dbpcy:Protocol106665108 < dbpcy:Rule106652242 < dbpcy:Direction106786629 < dbpcy:Message106598915 < dbpcy:Communication100033020 < dbpcy:Abstraction100002137 in the list of ranges for rdf:type of the sto:BBF_TR-069.

The question is - should anything from such chains be removed from the enriched ontology?

Another example is sto:SCOR -- rdf:type -- dbpcy:Person100007846.

This case is easier because such a concept is simply wrong and we can exclude the whole chain with dbpcy:Person100007846 in it from the enriched ontology.

igrangel commented 6 years ago

To meet this requirement, there should be something to compare to. The ontology could be one and maybe instances could be compared if they are correct instantiations of a given class in the ontology. In case that this occurs, these classes should be removed from the full chain. Still, in the end, we need to truthto compare with.

mmaltsev commented 6 years ago

The only solution that came to my mind was to narrow down the classes for each standard. That is - to exclude all super classes and leave only those which are at the bottom level of the "DBpedia class tree". Such an approach was implemented here.

Applying it to the OPC_UA leads to the following. before:

sto:IEC_62541 a dbpcy:Abstraction100002137,
                dbpcy:Communication100033020,
                dbpcy:Direction106786629,
                dbpcy:Measure100033615,
                dbpcy:Message106598915,
                dbpcy:Protocol106665108,
                dbpcy:Rule106652242,
                dbpcy:Standard107260623,
                dbpcy:SystemOfMeasurement113577171,
                dbpcy:WikicatComputerStandards,
                dbpcy:WikicatNetworkProtocols,

after:

sto:IEC_62541 a dbpcy:WikicatComputerStandards,
                dbpcy:WikicatNetworkProtocols,

Applying it to the enriched ontology yields into this. Such a process removes 429 triples overall. In addition, some of the class chains, like WikicatBusinessModels -> ... -> PhysicalEntity100001930 were totally excluded.

igrangel commented 6 years ago

The problem, in this case, is that we may be removing facts that are true. E.g., OPC UA can be considered as a dbpcy:Communication100033020, and dbpcy:Standard107260623. To make this right we need to have a Gold Standard or at least to be able to access the ontology. Which criteria did you use to remove the triples? How this can be validated?

mmaltsev commented 6 years ago

The reason why I excluded such superclasses as dbpcy:Communication100033020 and dbpcy:Standard107260623 was: 1) they don't provide any additional information because dbpcy:WikicatStandards or dbpcy:WikicatANSIStandards or any other "bottom-level" class is automaticaly a dbpcy:Standard107260623. 2) dbpcy:Standard107260623 itself is just some inner uuid inside DBpedia which doesn't even always mean that it is "Standard" as we understand it. Moreover, this kind of information doesn't provide us any useful knowledge - we can't really use it.

Some of the classes were removed because their "top-level" super class was PhysicalEntity100001930 which generally describes people, events, etc.

This solution might be not the best because it excludes some of the classes which are true, but at least it narrows down to those classes which are easy to check and to unerstand where they come from.

igrangel commented 6 years ago

Can you evaluate what would be the precision only of this example, with and without removing? - Check this

mmaltsev commented 6 years ago

For the sto:IEC_62541 in terms of precision, considering that from a human perspective, standard is not a Communication (dbpcy:Communication100033020), Direction (dbpcy:Direction106786629), or Message (dbpcy:Message106598915), then the precision before the cleaning would be p = 8/11 and after p = 2/2 = 1. It'll look like that, again, only after human interpretation.

From the perspective of DBpedia, as a system, all of these classes, i.e. Communication100033020 or Message106598915 just represent different layers of abstraction hierarchy for the DBpedia resource. Thus, in this case either way (before and after the cleaning), p = 1.