linkml / prefixmaps

Semantic prefix map registry
https://linkml.io/prefixmaps/
Apache License 2.0
10 stars 3 forks source link

Add tests for integrity of merged contexts #26

Closed cthoyt closed 1 year ago

cthoyt commented 1 year ago

15 introduced data integrity tests, but did not apply them to the merged contexts, because there seems to be some issues with the logic that generates them.

The merged and merged.oak strings should be removed from the following (effectively leaving skip = set() to go along with any updates that it takes to generate merged contexts with bijectiveness guarantees.

https://github.com/linkml/prefixmaps/blob/433da1f1e74bfa470809eb5640fbeb93bda4c368/tests/test_core/test_integrity.py#L20

cmungall commented 1 year ago

If we don't skip these, then

    def test_namespace_aliases(self):
        """Test that prefix aliases have a valid namespace."""

will fail, but this is expected.

this is because we have:

merged,sdo,https://schema.org/,namespace_alias

from prefix.cc -- which is junk because, the correct semantic URL uses http not https, so there is no direct deterministic join point with the correct canonical prefix or IRI:

merged,schema,http://schema.org/,canonical

so this is effectively orphaned

now, we could extend the test to do a transitive walk over aliases and use the following as a join point:

merged,schema,https://schema.org/,prefix_alias

but this would be overkill, and not necessary, as the fundamental assumption here doesn't hold - the merged context does have integrity, it provides the correct bijective mapping for schema.org, according to the precedence rules in which the merged context is built (with the junky prefix.cc having lowest priority).

The thing to remember here is that the only thing the API exposes is the canonical mappings, the rest is just there for debugging purposes. It could be argued that a cleaner design would be for the CSVs to only include the canonical mappings for that context, and to put additional anciliary metadata that arises from the ETL elsewhere - a separate issue could be made for this if it's a priority (this would only be a priority for making the library easier to understand, as it wouldn't affect the users of the library)