Open nichtich opened 7 months ago
Looks like see_also can be used to hold the URI of the individual mapping, can't it?
I renamed the issue now, and I keep stumbleing over the fact that this issue has been discussed a lot of times, in the context of other issues like #91.
Adding mapping_id
as a required slot is out of the question I think, because identifier management is too much churn for most users that just want to add a quick table with mappings to their repo.
I see three options:
mapping_id
exists.
I dont know the answer, but we need to make a decision, so thanks for bringing this up again. There are so many advantages of having mapping_ids including:
Let’s first clarify what needs to be identified here: mappings proper, or mapping records?
For clarity, what I call a mapping proper (or core mapping) is purely a triple {subject, predicate, object} (and optionally a predicate modifier). A mapping record is a mapping proper plus additional metadata (justification, provenance, etc.) – basically, a mapping record is a row in a SSSOM/TSV file.
SSSOM primarily deals with mapping records, not core mappings.
So, again, what needs to have a unique identifier here: the core mapping, or the mapping record?
This needs to be clarified before any further discussion because the answer will dictate what can be done.
Notably, the ticket is about identifying individual rows, so they ask for an identifier for mapping record. But Nico’s second idea (constructing an ID out of the subject-predicate-object triple) is only suitable to identify a core mapping (where several records could have the same identifier because they are about the same core mapping), so it would not be applicable if the goal is to identify records.
For what it’s worth, I think having unique identifiers for core mappings does not make much sense. I believe in most case it’s records that people will want to uniquely identify.
That is, we don’t need an identifier for the statement “the sky is blue“ (core mapping). We might need an identifier for the fact that John Doe (author), on April 9th 2024 (date), on the basis of careful observation (justification), stated that the sky was blue (core mapping).
Yes I should have clarified this.
For what it’s worth, I think having unique identifiers for core mappings does not make much sense. I believe in most case it’s records that people will want to uniquely identify.
Hmmm.. So far I had exactly the opposite sense, because I was thinking about:
On the other hand, I can see why you would want to have an identifier for the whole record:
Grrrrr not an easy one to solve.
Give me all the evidence that "the my:sky skos:exactMatch blue:sky" requires that all justifications (metadata) can be aggregated to one and the same mapping, so a landing page say in a browser would collect all justifications for a specifc mapping and show them to the user.
I don’t see how this requires an identifier for the core mapping. Given a mapping set, it’s very easy for an application (such as a mapping browser of some sort) to query the set to get all the mapping records that have the same subject, the same predicate, and the same object.
For efficiency purpose the application could build its own pseudo-identifier for each triple (for example by concatenating and hashing the subject-predicate-object triple) and have a cache map associating records to such pseudo-identifiers (to avoid having to always query the set), but that would be an internal implementation detail that doesn’t need to (and, In my opinion, shouldn’t) appear in the model.
One reason why such a core mapping identifier shouldn’t be part of the model: it opens the way for possible discrepancies. What’s to prevent someone from misusing the identifier and doing, for example, this:
core_mapping_id subject_id predicate_id object_id mapping_justification
MAP:0001 EXA:123 skos:exactMatch EXB:456 semapv:ManualMappingCuration
MAP:0002 EXA:123 skos:exactMatch EXB:456 semapv:LexicalMatching
where both records are about the same core mapping (EXA:123 skos:exactMatch EXB:456) but with a different justification. Yet whoever created the mapping mistakenly attributed two different identifiers, which makes it appear that those two records are about two different core mappings. What should an implementation (or another user, for that matter) do here? Believe the identifiers and treat the two records as being about different core mappings? Or believe the subject-predicate-object triples and treat them as being about the same core mapping?
(You could imagine the exact same problem with two records mistakenly having the same “core mapping identifier” despite having, for example, a different predicate. Again, what to do in such a case?)
And no, mandating in the spec that the core mapping identifier should be constructed in a predictable way from the SPO triple (for example by concatenating the components of the triple, as in your idea 2) would not solve the problem, because you could always get a set that were incorrectly generated by a bogus implementation.
Not having the core mapping identifier in the set at all, but instead let the applications compute such an identifier internally if they need one, avoids all those problems.
I also like an identifier for core mapping / mapping proper / triple + mod, encoded e.g. base64.
This will give us stable retroactive ids
But for full SSSOM mapping records, I worry about when the data model changes. Can stable, retroactive ids or otherwise, be guaranteed in that case? If not, then maybe prepending the SSSOM version in front of such encoded ids would do the trick.
We can do ids for both. mapping_id
and mapping_quad_id
. It's a bit cluttery. Should we?
And no, mandating in the spec that the core mapping identifier should be constructed in a predictable way from the SPO triple (for example by concatenating the components of the triple, as in your idea 2) would not solve the problem, because you could always get a set that were incorrectly generated by a bogus implementation.
I don't agree it doesn't solve the problem. We can have sssom-py
generate such ids, and even let users do it on their own by keeping the algo simple, like concat and base64 / md5, and documenting it on the slot.
Users are always prone to make boo-boos. We can alleviate that by adding this to sssom validate
.
Finally, I advocate for an official algorithm for this because it makes the ids globally unique / interoperable.
That's another reason to have an id on triples/quads too. Interoperability between mapping specs, though concatenation alone may also suffice.
I also like an identifier for core mapping / mapping proper / triple + mod, encoded e.g. base64.
This will give us stable retroactive ids
A core mapping is already stably and uniquely identified by its subject ID, its predicate ID, and its object ID. What would we gain by having yet another identifier that is merely a derivation of those three?
We can have sssom-py generate such ids
SSSOM-Py is not the only software in the world who has to deal with SSSOM mapping sets. That’s actually why we try to have a standard specification.
keeping the algo simple, like concat and base64 / md5, and documenting it on the slot.
OK. Simple question then: do we concatenate the IDs in their CURIE form or in their expanded form?
The former will make SSSOM users/developers from the RDF world unhappy, because in RDF there are only IRIs, and having to condense the identifiers to a CURIE form just for the sake of deriving a triple identifier is downright ridiculous. Besides, it means the derived identifier will actually depend on the curie map used to condense the full-length identifier – so much for the stability and the interoperability of the triple identifier.
The latter would make more sense, but it will make the SSSOM users coming from the bioinformatics world unhappy, because they usually prefer to deal with CURIEs and the idea of having to expand CURIEs to full-length IRIs just for the sake of deriving a triple identifier will likely seem equally ridiculous to them.
By letting the applications that need a triple identifier (e.g. for performance reasons) compute their own internally without ever exposing it to the outside world, we avoid those problems entirely.
Users are always prone to make boo-boos.
Well-designed tools and formats try to minimize the potential for boos-boos.
As for identifiers for mapping records (which is what the author of ticket asked for, remember):
But for full SSSOM mapping records, I worry about when the data model changes. Can stable, retroactive ids or otherwise, be guaranteed in that case?
What do we actually expect from record identifiers? Do we want/need identifiers that depend entirely on the content of the record (like some kind of hash value calculated on the record), or do we want arbitrary identifiers (like serially or randomly generated identifiers)?
Let’s say I have this mapping record:
subject_id predicate_id object_id mapping_justification semantic_similarity_score
FBbt:1234 skos:exactMatch CL:5678 semapv:LexicalMatching 0.87
and let’s assume it has an identifier M1 (regardless of how that identifier has been obtained/generated/derived). I use that identifier to annotate a bridging axiom in an ontology as being justified by the existence of this mapping record.
Now let’s say that upon refreshing the mapping, we find that the similarity score is no longer 0.87 but 0.91 (maybe because the algorithm of the matching software has changed a bit). Thus we update the mapping record accordingly:
subject_id predicate_id object_id mapping_justification semantic_similarity_score
FBbt:1234 skos:exactMatch CL:5678 semapv:LexicalMatching 0.91
This is no longer the same record, right (the core mapping is the same, but at least one metadata field is different)? Now, question: Should M1 point to that updated record?
If yes, it means we want identifiers that are not (or at least, not completely) dependent on the contents of a record.
If no, it means we can have identifiers that are entirely derived from the contents of the record, but it raises the question of the usefulness of such identifiers if a single change in a record can make them invalid.
OK. Simple question then: do we concatenate the IDs in their CURIE form or in their expanded form?
Good point. I agree it should be expanded form.
A core mapping is already stably and uniquely identified by its subject ID, its predicate ID, and its object ID. What would we gain by having yet another identifier that is merely a derivation of those three?
I was thinking of that too. The main reason I can think of is that it could be less messy / confusing. Consider:
http://purl.obolibrary.org/obo/MONDO_123456789|skos:exactMatch|http://purl.obolibrary.org/obo/DOID_123456789|someMod
aHR0cDovL3B1cmwub2JvbGlicmFyeS5vcmcvb2JvL01PTkRPXzEyMzQ1Njc4OXxza29zOmV4YWN0TWF0Y2h8aHR0cDovL3B1cmwub2JvbGlicmFyeS5vcmcvb2JvL0RPSURfMTIzNDU2Nzg5fHNvbWVNb2Q=
e401ea1a76b7e3e9d82b3969a9256f2b
mapping_id
For the option where we are going with a hashed or concatenated ID and are considering more fields than the full quad, @gouttegd you bring up good points: Does it make sense to incorporate all fields; even ones where the value might change often like semantic_similarity_score
? It looks like you'd leave such fields out, and I'm heavily leaning towards that as well.
Consider: Quad:
http://purl.obolibrary.org/obo/MONDO_123456789|skos:exactMatch|http://purl.obolibrary.org/obo/DOID_123456789|someMod
What is the use case of an IRI pointing to a core mapping? I can understand wanting an IRI pointing to a mapping record, but to a core mapping, I can’t.
It looks like you'd leave such fields out
No. For now I am actually of the opinion that for record identifiers (leaving aside the question of core mapping identifiers), the identifier (if we do decide to add an identifier field to the spec) should be an opaque string about which the spec makes no assumption at all.
Up to the users and/or the implementations to decide how the identifiers should be generated (serially, randomly, or by deriving a hash value from all or some of the records’ fields). The spec should not say anything about that.
If we were to specify a way to derive identifiers predictably from the contents of the records, I doubt we could all agree on what should be the expected behaviour, in particular because there may not be a behaviour that is more correct that the others. There will be cases where we will want a record ID to still be valid even if 5 metadata fields have been changed, and cases where we will want a new record ID because a single metadata field has been changed. The spec should not force its view here.
What is the use case of an IRI pointing to a core mapping?
Never mind, whether I can see a use for it or not does not matter.
The point is, such an identifier can always be derived in a completely predictable manner, so there’s no reason to have a field for it in the data model and the serialization formats. There would be no sense in doing this:
core_mapping_id subject_id predicate_id object_id mapping_justification
XXXCOREMAPIDXXX MONDO:123456789 skos:exactMatch DOID:123456789 semapv:ManualMappingCuration
where XXXCOREMAPIDXXX would be some kind of hash computed on MONDO:123456789, skos:exactMatch, and DOID:123456789. This would merely duplicate (in a less readable way!) an information that is already present on the rest of the row. Applications can compute the identifiers on the fly if they need it.
(Aside: This is also why I dislike the mapping_cardinality
field; I believe it should not exist, and actually SSSOM-Java ignores it completely when it is present – SSSOM-Java always computes the cardinality from the contents of the set instead of trusting the mapping_cardinality
field. But at least there is some rationale for having this field, because inferring the actual cardinality involves examining the entire mapping set and is therefore a relatively time-consuming operation, so it makes some sense to write the result down; by contrast, computing the core mapping identifier is an operation that depends on nothing else but the current row.)
I would have no (strong) objection if the spec defined a standard algorithm to derive such an identifier (instead of letting each application implement its own algorithm as I initially stated), so that all applications automatically derive the same identifiers for the same core mappings. But this has nothing to do in the data model and the data formats.
Nice discussion! This will take a while to hash out, so I hope we are all ready to be patient. I cant see a single thing being said in this thread that is completely wrong so far, so we need to accept a bit that our judgements are all a bit from the perspective of our use cases. Two small questions before we move on (please add the appropriate emoticon reaction):
This will take a while to hash out
Only because you want to do something that the author actually didn’t request. What is requested is a new slot to uniquely identify mapping records (“slot to identify individual rows” – they’re clearly talking about records here, not core mappings). This doesn’t need much elaboration.
We can add a new mapping_id
slot with a type of EntityReference
, make it optional or mandatory, with the simple constraints that:
Apart from enforcing those constraints, the spec would say nothing more about this slot. It will treated as any other identifier in a set, that is basically as an opaque string to which no particular meaning should be attached (apart from the fact that it’s a string that can exist in two forms, full-length or CURIEfied – again as any other identifier).
It’s up to the users and the implementations to decide:
About the uniqueness constraint
A set with two records (or more) with the same mapping_id
is invalid. When writing, an implementation MUST NOT create a set with duplicated mapping_id
values. When reading a single set, a parser SHOULD reject the set if it happens to contain duplicated mapping_id
values. When performing a merge operation, it is left to the implementations/users to decide how to handle the situation where the same mapping_id
is present in several of the sets to be merged (aborting the operation, trying to merge the records, only taking the last encountered record, etc).
Mandatory or optional?
I am mildly against making such a mapping_id
slot mandatory, mostly on the ground that I believe it is too late to add new mandatory slots. I can be convinced otherwise, though, provided both major implementations (SSSOM-Py and SSSOM-Java) promise to accept identifier-less mapping sets for the time being (that is, if they encounter a set without mapping_id
, they silently auto-generate some IDs, and output a conformant set), so as to avoid breaking existing sets.
So, again, what needs to have a unique identifier here: the core mapping, or the mapping record?
My use case requires identification of mapping records. I want to convert mappings between JSKOS and SSSOM. Each mapping in JSKOS has an URI (aka mapping_id
) to link additional statements, e.g. review of the mapping. The mapping_id
does not need to be mandatory but without the mapping is either filtered out or duplicated when the same set of mappings is imported twice. It's up to the specific application to decide. The mapping_id
should be unique per record set, otherwise an error is thrown or the first mapping record is overwritten by the second having the same URI.
> > Quad: `http://purl.obolibrary.org/obo/MONDO_123456789|skos:exactMatch|http://purl.obolibrary.org/obo/DOID_123456789|someMod` > > What is the use case of an IRI pointing to a core mapping? I can understand wanting an IRI pointing to a mapping record, but to a core mapping, I can’t. My example "quad" is showing an example of what an identifier might look like when concatenating the composite key of the "core mapping", or "quad" of subject, predicate, object, modifier. I agree that the question of the value of an ID for core mapping in addition to or instead is valid; good discussion so far on the merits of various options.
This would merely duplicate (in a less readable way!) an information that is already present
My example was about a core mapping but the same applies (much more so, even) to mapping records. I think that field concatenation, e.g. http://purl.obolibrary.org/obo/MONDO_123456789|skos:exactMatch|http://purl.obolibrary.org/obo/DOID_123456789|someMod
is less readable and, more importantly, introduces potential parsing issues and contains characters that may not be valid for identifiers in a lot of situations, than a hash, e.g. e401ea1a76b7e3e9d82b3969a9256f2b
.
This issue was not meant to be about core mapping but you might be interested we create identifiers for core mappings as well in our mapping infrastructure:
urn:jskos:mapping:members:
urn:jskos:mapping:content:
Github is really bad for these discussions. The issue has become too long already for any other person to enter the discussion, and its my fault. Lets forget about identifiers for core mappings for the moment and return to the ask in the OC.
Am I seeing it correctly that that primary use case for mapping_id
is RDF /FAIR, i.e. a way to talk about a specific mapping record outside of the SSSOM context? Or is there another use case to consider?
Am I seeing it correctly that that primary use case for
mapping_id
is RDF /FAIR, i.e. a way to talk about a specific mapping record outside of the SSSOM context?
Exactely!
@joeflack4
Readability: hashing vs concatenation
I think you missed my point. My issue has never been about whether the identifier was made solely by concatenating or by concatenating-and-hashing.
My issue is with the basic idea of deriving the identifier from other fields, regardless of how that derivation is made.
Whether you merely concatenate the subject-predicate-object triple or you concatenate it and then hash it, it boils down to the same: you end up duplicating the information containing in the triple itself.
I don’t mind defining a standard algorithm to derive such a content-dependent value, but I object to having it stored anywhere in the model. It’s useless since it can always be re-generated on-demand, and its presence would cause more harm than good (notably by creating the possibility of discrepancies, and making the a set harder to modify by hand since the identifier will need to be updated after any change).
the same applies (much more so, even) to mapping records
I object even more strongly to using a content-dependent “identifier“ for mapping records.
In fact I don’t understand where this idea, that the identifier should be derived from the contents of the record, comes from. Imagine if we were doing the same for OWL classes in an ontology, if the identifier of a class was derived from the definition of the class. You do any change to the class (adding a synonym, a new relationship, amending the textual definition, whatever), and boom, the identifier of the class changes! All references to the old identifier are now invalid! I hope anyone would find that idea ridiculous, so why are we entertaining it when it comes to mapping records?
@nichtich
mapping members intentifier is calculated as SHA1 of sorted list of subject, predicate (while n-to-m mappings can hold multiple) prepended by urn:jskos:mapping:members:
Again, no objection to defining a similar algorithm in SSSOM, but strong objection to adding a field to store it.
mapping content identifier is calculated as SHA1 of normalized JSKOS mapping record in JSON prepended by urn:jskos:mapping:content:
For mapping records, I really don’t see the point of doing that. But similarly, no strong objection to defining an algorithm to do the same for SSSOM records, provided again that we don’t reserve a field to store it. The mapping_id
slot, if we do create one, should be for an opaque identifier that is not dependent on the contents of the record.
(Well, if some users want to store in that field an identifier that they generated based on the contents of the record, they would of course be free to do so. But the spec would not mandate that.)
@matentzn
Github is really bad for these discussions. The issue has become too long already for any other person to enter the discussion, and its my fault.
Unpopular opinion: Good old mailing lists are much better, if only because they naturally allow threaded discussions (provided you use a decent email client). But cool kids nowadays don’t want to use email anymore.
That being said, if you think this discussion is a long and protracted one, you’ve probably never seen what’s happening on IETF mailing lists! :D
Hello everyone. I am following the discussion here. I prefer to make a quick comment before getting into more details as it might take more time.
The semantic web tells us that as soon as you start to say something about a resource you identify this resource with a URI and it becomes the subject of RDF statements. (Or you deal with blank nodes!!)
SSSOM is all about saying something about one specific mapping. This is the role of all the slots/properties defined by SSSOM. Thus it is obvious to me that a mapping must have an ID (a URI when moving to RDF).
SSSOM is all about saying something about one specific mapping. This is the role of all the slots/properties defined by SSSOM. Thus it is obvious to me that a mapping must have an ID (a URI when moving to RDF).
It looks like you’re talking about a core mapping here: the SSSOM properties that make up a record are to say something about a (core) mapping, so that (core) mapping would need an identifier so that we can talk about it. Is that indeed what you mean?
If so, I (again) disagree. If I annotate an axiom in an OWL ontology, I am making a statement about the axiom. But the axiom itself doesn’t have its own identifier. It doesn’t need one, since it is already uniquely identified by its triple {source, property, target} – something that the XML/RDF serialisation makes quite obvious:
<owl:Axiom>
<owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/CL_0000014"/>
<owl:annotatedProperty rdf:resource="http://www.w3.org/2000/01/rdf-schema#subClassOf"/>
<owl:annotatedTarget rdf:resource="http://purl.obolibrary.org/obo/CL_0000034"/>
<oboInOwl:is_inferred>true</oboInOwl:is_inferred>
</owl:Axiom>
So, when the thing you want to make a statement about is nothing more than a combination of identifiers (such as a triple), it’s not obvious that you need yet another identifier to identify that particular combination.
So I still don’t see the case for core mapping identifiers.
As for mapping record identifiers, I do agree that they are probably a good idea, they have been explicitly requested, and I’ve made a proposal to add a field for it. Can we discuss about that?
What we would need to agree on here is:
Would a mapping_id
be optional, recommended, or mandatory. I’m inclined towards making it optional, maybe recommended. As I’ve said before, I am mildly against making it mandatory. At least in the spec – this would not prevent some users, for their own applications, to require that a mapping_id
always be present.
Arbitrary identifiers or content-dependent identifiers? By arbitrary identifier, I basically mean what I said In my proposal above: from the point of view of the spec, the identifier is merely an opaque string that is completely independent of the contents of the record. By contrast, a content-dependent identifier is an identifier that is somehow derived from the record, possibly by concatenating and hashing some or all of the fields that make up the record. I’ve already stated what I think about such content-derived identifiers.
Moved the core mapping id discussion here:
and the discussion about motivation here:
https://github.com/mapping-commons/sssom/discussions/360
@jonquet please add your ideas/motivation there!
After reflexion, I think that if we do end up creating a slot to hold a mapping record identifier, it should probably not be called mapping_id
.
We already have a mapping_justification
slot, which is used to provide the justification for the (core) mapping. So mapping_justification
is understood as “the justification for (or of) this core mapping”.
It is likely that many people would likewise interpret mapping_id
as meaning “the ID for this (core) mapping”, thereby mistaking the slot as being intended for a core mapping identifier rather than a record identifier.
I think it would be best to minimise the potential for such needless confusion. I’d suggest something like record_id
instead to make things more clear.
...a content-dependent value, ...making the a set harder to modify by hand since the identifier will need to be updated after any change).
Just want to highlight; good point. I think this is one of the stronger cons of content-dependent GUIDs.
I think this is one of the stronger cons of content-dependent GUIDs
But this is not a problem at all if we agree that such GUIDs, if we do define them, must always be computed on the fly and never stored.
That is, the spec defines the GUID generation algorithm, and implementations provide a helper method that implements said algorithm (a method that takes a mapping and returns the derived GUID). Applications that for some reasons need such a GUID can then easily obtain it, without the GUID having to ever appear in a SSSOM file.
And defining such a GUID derivation algorithm does not preclude from also having a record_id
field intended to store an arbitrary identifier. We can completely have both.
That is, the spec defines the GUID generation algorithm, and implementations provide a helper method that implements said algorithm (a method that takes a mapping and returns the derived GUID). Applications that for some reasons need such a GUID can then easily obtain it, without the GUID having to ever appear in a SSSOM file.
I think this is the best way forward. Note however that:
must always be computed on the fly and never stored.
Is only true for the TSV serialisation - the whole point of this exercise is so that the RDF serialisation does have a resource identifier to satisfy the the FAIR principles.
I also how to enforce the logical inverse: that the field is never manually said. This may be awkward without custom validation code, and maybe some people will ignore this. Maybe a slightly less brutal approach would be that the identififier must be valid in accordance to the GUID generation algorithm. This can be easily validated and we dont need to police as much if people for some reason which to share their mappings with that id in it.
must always be computed on the fly and never stored. Is only true for the TSV serialisation - the whole point of this exercise is so that the RDF serialisation does have a resource identifier to satisfy the the FAIR principles.
This would only be needed if the content-derived “identifier“ was the only identifier – something that I strongly oppose.
If we do create a field to store a record identifier, it should be for an arbitrary identifier that is not automatically generated from the contents – an opaque string, as for (as far as I know) any other identifier in the RDF world.
An automatically derived content-dependent “identifier“ (what we have called in the previous messages a “GUID”, though this is misleading as there are others way to get “globally unique identifiers“ without deriving them from the contents of what they identify) would be an additional feature of the spec, not the only identifier. And we should only define such a GUID if we have a good reason to do so, which for now we don’t – what is the use case for an “identifier“ that changes every time the contents of the record changes?
I bet this has been mentioned before but I did not find the corresponding issue: SSSOM lacks a slot to identify individual rows. This is required to fully map from/to JSKOS Mapping format (#249) but I'm sure there are other use cases as well.