Entity reconciliation between schemas and ontologies

agnescameron commented 3 years ago

This came up in this months' call, and I wanted to give a full explanation of the use case I was describing (as it's 'a bit meta'), which can be shaped into more of a feature request / mailing list object through discussion. I originally brought this up in relation to the discussion of type hierarchies in #68 -- my impression is that the main distinction between these cases is that here, the entities being resolved are themselves types, as specified by an ontology.

What we're trying to achieve:

Taking a range of datasets, produced by different people working in a similar context (in this case, innovation data) and reconciling the dataset schemas against a common ontology. This could based on either just the string information of the column headers, but more ideally, a combination of the column header and the data type, or the relationships between different columns within the schema.

The goal of the work we're doing is to build graph of relationships between datasets, allowing merging/querying operations across a range of diverse data sources. There's also a version of this where the entities within the dataset also get reconciled (which looks a lot more like the traditional reconciliation API), but it would be interesting to know what's possible with an ontology alone.

For example: I know 3 different researchers, all of whom use patent identifiers in their datasets in a different format. The WIPO standards ontology specifies patent identifier formatting as part of a hierarchical ontology.

If I wanted to specify what identification scheme was being used in each instance, I could: each PatentPublicationIdentification has a PatentPublicationIdentificationType, which is composed of a sequence of up to 5 different objects including PublicationLanguageCode, PatentDocumentKindCode and PublicationDate. 2 of the 5 are optional, and many of these also have further possible type specifications (e.g. PublicationLanguageCode can have a different ExtendedISOLanguageCodeType depending on when the identifier was specified).

While it's possible to go through this process manually (either by going through the WIPO schema, or using guides to patent ID construction), crosswalking column types like this can be a real pain, especially for newer researchers not versed in the foibles of different notation.

It's possible that the entity reconciliation API is not the place for this problem, but it would be interesting to know what would work well -- so many ontologies get specified but then under-used when it comes to actually linking published schemas to their corresponding types. Are there existing workflows that anyone's familiar wifth for producing this kind of metadata (I'll add them to the census if so)?

thadguidry commented 3 years ago

This is a general problem where in the past less efficient software for mapping, XML, RDF, etc. have evolved over time to be much better now in 2021, but it depends on the actual use cases...and if you want to even involve the Semantic Web or not, and publish, or republish, as often is the need. For instance, a lot of Schema such as the http://www.wipo.int/standards/XMLSchema/ST96/ are not actually vocabularies in the traditional sense, but real schema for a particular niche set of domains, where no attempt to map to Linked Open Vocabularies or otherwise was part of the effort. (Closed World vs. Open World)

I think your immediate mapping needs from Schema <-> Schema or even Any <-> Many might be best accomplished with perhaps a tool and server in the market used quite a bit for that need, Altova MapForce / Server / XMLSpy

XML

As far as a history lesson of how far we have come this page goes over a broad set of tools and software, some no longer used or available: https://www.w3.org/wiki/XML_Schema_software

RDF

Gosh, there's so many over the decades depending on the needs, but practically, mapping existing DB's to RDF was very common in Academia and Enterprise. Here's the dated 2009 state of the art: https://www.w3.org/wiki/Rdb2RdfXG/StateOfTheArt

Linked Open Data

Nowadays, many of the maps are just directly embedded into Wikidata itself through the various SKOS-related properties as I've done with Schema.org and other ontologies I have loosely mapped into it. One example: https://www.wikidata.org/wiki/Q26907166 Yes, manually. But a general Excel or LibreOffice "lookup" function or OpenRefine cross() can go a long way to map things "cheaply"...but as I stated, you often need all the power that a good tool like Altova's tools gives you and then can allow you to publish or upload to share those maps with the world.

Semantic Web

Browsing and developing an ontology is totally different than the needs of mapping or linking ontologies. I've lost touch with a lot of the open source world's effort for mapping or linking ontologies because to me there was no standards, everyone doing their own thing at different levels, and Wikidata evolved into being a common place for doing that mapping to share with the world more easily. Still here's an older page from 2010 with an updated link: https://www.w3.org/wiki/SemanticWebTools

thadguidry commented 3 years ago

@agnescameron Someone just mentioned to me (offlist) that you would likely be better served by asking folks within the W3C DXWG where interoperability and mapping are exactly their focus: https://www.w3.org/2017/dxwg/wiki/Main_Page

agnescameron commented 3 years ago

@thadguidry thanks for this! I hadn't encountered DXWG before, but they seem really ideal for this instance. the XML schema software history is also great.

fsteeg commented 3 years ago

@agnescameron: I don't think I fully understand your use case yet, but did I mention Cocoda (https://coli-conc.gbv.de/cocoda/) in the meeting? I have not worked with it myself, but it uses the reconciliation API (acting as a client instead of OpenRefine) to create mappings between different ontologies.

agnescameron commented 3 years ago

@fsteeg Cocoda seems really well-suited to this use-case; will test it out and report back. thanks!

thadguidry commented 3 years ago

Oooo... nice find, Fabian. Didn't know about Cocoda.

Ah, and here's some of the background research that went into and is layered into the Cocoda tool. https://coli-conc.gbv.de/hub/

Interesting. Played with it just now... it's nice that you can stay on the Scope column to keep that context in view and just click back into the search input box and click on the next suggestion and then see the Scope. But they should allow a dual view... seeing the list of results in a real panel (not dropdown interface) AND showing Scope. Something I'd like us to eventually have in OpenRefine to make recon easier. Dedicated Recon panel/subpanels.

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Tue, Aug 17, 2021 at 9:50 AM agnescameron @.***> wrote:

@fsteeg https://github.com/fsteeg Cocoda seems really well-suited to this use-case; will test it out and report back. thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/reconciliation-api/specs/issues/72#issuecomment-900366706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQ2RVFQYMWZKAYADXMCEDT5JZMFANCNFSM5AYJLDVQ .

reconciliation-api / specs