Update all labels according to a standard convention (replaces #20)

DanCarey404 commented 4 years ago

Add/replace rdfs:label values according to the agreed-to standard.

rjyounes commented 4 years ago

@DanCarey404 Can you please add a summary of the agreed-upon standard so that it is documented here? Then can #20 be closed?

rjyounes commented 4 years ago

Replaces #20

DanCarey404 commented 4 years ago

Per Rebecca's request, these are the labeling standards being implemented.

Classes

Sentence case
Normalized to natural language standards. E.g., hyphens inserted, acronyms in all caps, etc.
- Examples: AMA guideline, ISBN-10

Properties

Same as classes, but initial lowercase
Examples: has unit of measure, has SSN.

rjyounes commented 4 years ago

Just to clarify: this was not my request: it was my proposal, which the group discussed and agreed on. :)

rjyounes commented 3 years ago

This task needs an assignee.

uscholdm commented 3 years ago

@sa-bpelakh Boris has a query for this. Can you pop it into this issue, for convenience?

sa-bpelakh commented 3 years ago

What I have are SHACL rules that validate that labels are conformant: https://github.com/semanticarts/platts-ontology/blob/develop/shapes/ontologyShapes.ttl. They enforce the policy described above, and even detect acronyms in all caps (minimum of 2 letters, I believe) and ignore their casing. The current version does not allow for numbers in class or property names, so if that's a requirement, we'll have to make some changes.

rjyounes commented 3 years ago

I think we should allow numbers; e.g., hypothetically we could define classes like Shimano105Components, Iso639 (subclass of Category), TourDeFrance2020Racers, CharactersIn1984, ...

sa-bpelakh commented 3 years ago

I can see that for a specific domain, but for the base gist? We can set up a fun game of regex golf for labels.

rjyounes commented 3 years ago

Maybe it's less likely in an upper ontology, but why exclude it in principle?

Re regex golf - that looks fun! Maybe at our next happy hour?

uscholdm commented 3 years ago

Maybe it's less likely in an upper ontology, but why exclude it in principle?

We get to choose our own stylistic conventions, as does each client project. I don't think we want gist to have numbers in IRIs and a rule for this would find a one the looks exactly like a lower case el. So I vote to put it in our gist checks as a warning.

rjyounes commented 3 years ago

If we disallow numbers in local names, and we happen to come across the need for one, we are then forced to spell the number out, which I think is worse. What's wrong with numbers in IRIs?

A reminder that this issue is the implementation of a set of conventions that had already been decided on and documented in the gist style guide. The point is not to revisit the decisions here. Quoting from the style guide that we had agreed on:

Alphanumeric characters only.
- Example: Isbn10, not Isbn-10 or ISBN-10.

This issue surfaced because I want to find a new assignee, since we agreed on the implementation back in April and have been postponing it since them.

rjyounes commented 3 years ago

@sa-bpelakh Platts and gist should be able to have different naming conventions. Is it onto_tool that applies the SHACL rules? If so, the SHACL shapes or files to invoke (or a folder containing them) should be configured in the YAML file, or stored in a particular directory, or something.

uscholdm commented 3 years ago

Its true that this issue should not get into what the style conventions are. That can be debated in a separate issue if anyone care enough to raise it.

sa-bpelakh commented 3 years ago

@rjyounes Yes, the bundle file configures which shapes to apply. So we can configure whatever we consider appropriate for gist, and customers can, um, customize 😄 whichever way they want.

marksem commented 3 years ago

Team has disagreement on the naming convention as of 12/10/2020 issues meeting. @DanCarey404 will poll SA ontologists. While there IS consensus to follow the standard, there is not consensus ON the standard.

(Detail: Some want Title Case for classes, not sentence case. Some want Title Case for all concepts. Rationale: all are concepts, and labeling for particular use cases (like sentence generation vs. column headings) won't always work. )

PS @sa-bpelakh will modularize the SHACL checking to allow ease of applying different conventions based on where starndard lands.

rjyounes commented 3 years ago

@marksem @DanCarey404 Can we please move discussion of this issue to a gist review meeting and notes here? Our goal is to be transparent, and decisions made by internal polling are not. In addition, there needs to be a rationale for reopening a decision that was made months ago. We cannot rethink every issue for those who did not attend the discussion. If someone who is unable to attend wants to provide input, that can be indicated here and we can accommodate them by scheduling a special meeting if needed.

rjyounes commented 3 years ago

My input is based on earlier decisions now recorded in the gist style guide:

Classes

Sentence case
Normalized to natural language standards. E.g., hyphens inserted, acronyms in all caps, etc.
- Examples: AMA guideline, ISBN-10

Properties

Same as classes, but initial lowercase
Examples: has unit of measure, has SSN.

Rationale

We adopt sentence over title case because the latter, while technically well-defined, has more complex rules and can introduce inconsistencies when implemented by different users.

Additional notes:

Sentence case vs title case: I hold by the decision made earlier: We adopt sentence over title case because the latter, while technically well-defined, has more complex rules and can introduce inconsistencies when implemented by different users.
Lower case for all properties, object and datatype
Acronyms in labels: since I believe that labels (as opposed to local names) should be in natural language form, acronyms should be spelled as they normally are. I will note that UoM is not an actual English-language acronym and therefore is not a good test case. We should also be careful about when an acronym is a prefLabel and when an altLabel: there are cases where the acronym is the most common term (e.g., "CIA", "FBI") and therefore it should be the prefLabel and the fully-spelled out version should be the altLabel, but there are also cases of the reverse (e.g., "Electronic Arts" not "EA").
Labels are meant to be in natural language, not camelcase etc. Therefore, hyphens are appropriate where they are used in natural language (in this case English) but not otherwise.

uscholdm commented 3 years ago

I find @rjyounes 's arguments and rationale compelling. If anyone wants to use labels for column headers then they can introduce a subproperty of altLabel called, say titleCaseLabel.

rjyounes commented 3 years ago

I didn't realize that one of the issues at stake in the renewed discussion was the use of labels as column headers. IMO that makes the case even stronger: it's hard to justify considering the preferred label as one designed for column headings or any other implementation-specific use. We have actually had this discussion during review of #20, where we reached the same conclusion as in @uscholdm's suggestion above, to define additional annotations for application-specific needs. In the case of column headers, they are (or could be) the same as the local names, so one could parse the IRIs to derive the local names for use as column headers and not maintain the values in an annotation.

DanCarey404 commented 3 years ago

I suggest that all words in a label have a leading capital. One reason for this suggestion is that Notepad++ has a convert case option (Proper Case) which does that, as does MS Word (Capitalize Each Word). This removes ambiguity from the rule and ensures the consistency that some are looking for.

rjyounes commented 3 years ago

@DanCarey404 Are you suggesting that even function words (prepositions, articles, etc) would be capitalized? That's not a type of casing I've ever heard of, other than the applications you mention.

rjyounes commented 3 years ago

One reason for using initial lower for properties: we use labels that are tied to the local names, and should preferably be derivable from them by some simple rules, such as adding whitespace at word boundaries indicated by camel-casing. Since our properties have local names with initial lowercase, this suggests the labels should follow suit.

rjyounes commented 3 years ago

These are the logical options for classes and properties:

Title case for all: Temporal Relation, Has Giver, Identified By (in title case, prepositions at the end of a phrase receive stress and are in upper case)
Title case for classes, lower case for properties: Temporal Relation, has giver
Sentence case for all: Temporal relation, Has giver
Sentence case for classes, lower case for properties: Temporal relation, has giver.
Same as local name: TemporalRelation, hasGiver
Lower case for all: This has not been mentioned and I doubt if anyone wants it; we can probably rule it out.
Every word upper case: Has Unit Of Measure

Note: 2-4 make exceptions for acronyms and terms that are generally capitalized: Social Security Number, has SSN, has Social Security Number.

I would reject 5 because a label is meant for humans and thus should be in natural language.

We haven't mentioned taxonomy terms. Logical options for taxonomy terms:

Title case
Sentence case
Lower case

Review of conventions used by well-known ontologies:

SKOS: Concept Scheme, exact match (2) PROV: SoftwareAgent, atLocation (5) FOAF: Online Account, based near (2) OAI-ORE: Aggregated Resource, Is Aggregated By (1) OWL Time: Duration description, has beginning (4) BIBFRAME (Library of Congress): Key title, Has event content (3) dcterms: Method of Accrual, Date Modified (1) Schema: Ignore Action, Accepted Offer (1) Lingvo: Language resource, resource type (4) Open Annotation: TextPositionSelector, hasBody (5) Ordered List Ontology: Ordered List, has ordered list (2)

Conclusion: There are no generally accepted conventions; we should choose whichever one we like best.

Note on title case: There is no one standard for title case: see https://en.wikipedia.org/wiki/Title_case. Chicago Manual of Style, Associated Press, etc. each define their own, though of course the broad convention is common to all. If we adopt title case, I propose that we choose one of these standard variants (or invent our own) and document it in the gist style guide as a reference for ontology developers and reviewers.

I also propose that labels conform to natural language standards by the insertion of, for example, hyphens, even if our standards for local names do not include such characters. E.g., ISBN-10 for class Isbn10.

rjyounes commented 3 years ago

Notes from 2021-01-14 triage meeting:

Dave: When do we see labels?

Graphics
Forms

Which would you rather see in these contexts?

Rebecca: we also see them in documentation (e.g., Widoco)

Peter: accuracy more important than typographic consistency

Will vote next meeting.

uscholdm commented 3 years ago

Thank you @rjyounes for comprehensive summary.

Conclusion: There are no generally accepted conventions; we should choose whichever one we like best.

Exactly.

We haven't mentioned taxonomy terms.

Most taxonomy terms are instances of gist:Category, which is a lot like a class, semantically. the key technical difference is that we use gist:categorizedBy instead of rdf:type to indicate what kind of thing something is. So we may want to adopt the same convention for taxonomy terms as we do for Classes.

rjyounes commented 3 years ago

These are the logical options for class and property labels:

Title case for all: Temporal Relation, Has Giver, Identified By (in title case, prepositions at the end of a phrase receive stress and are in upper case)
Title case for classes, lower case for properties: Temporal Relation, has giver
Sentence case for all: Temporal relation, Has giver
Sentence case for classes, lower case for properties: Temporal relation, has giver
~Same as local name: TemporalRelation, hasGiver~
~Lower case for all: This has not been mentioned and I doubt if anyone wants it; we can probably rule it out.~
~Every word upper case: Has Unit Of Measure~

Offline voting yields #2 as the winner.

Rebecca will compile a short list of title case conventions for consideration at next meeting. The selected convention will be included in the gist style guide.

rjyounes commented 3 years ago

I've sorted through a number of style guides from reputable sources (AP, APA, Chicago Manual of Style, MLA, NYT, Wikipedia). The details are included in the attached document as I think they will not be of general interest. I've come up with an amalgam of various conventions that is also computable (e.g., a rule to capitalize nouns, verbs, adjectives, adverbs, and pronouns, or to lowercase prepositions unless stressed, is not computable), as follows:

Capitalize: a. First and last words b. Words of four or more letters c. Second part of hyphenated word (e..g, Data-Centric, not Data-centric)
Lowercase: a. Articles: a, an, the b. Conjunctions: and, but, if, for, or, nor, so, yet c. Prepositions: as, at, by, cum, ere, for, in, of, off, on, out, per, pre, pro, qua, re, sub, to, up, via
Capitalize everything else

Attachment: Title Case Conventions.pdf

rjyounes commented 3 years ago

Regarding automated conversion of local names to labels: there's an issue in the conversion of acronyms and hyphenated words. There are two possible local name conventions:

Represent as in natural language - generally all uppercase - e.g., hasSSN
Represent in camel case - e.g., hasSsn. The argument is that word boundaries can be easily detected. isCiaAgent allows word boundary detection, while isCIAAgent does not. Even for human users, the word boundary is easier to see in the former.

However, labels should include natural language formats: is CIA agent, not is Cia agent. The correct version cannot be algorithmically computed from either local name.

The same may be true of hyphenated words, depending on the local name convention. ISBN-10 can be automatically computed from ISBN-10 but not from Isbn-10, ISBN10, or Isbn10.

In fact, in general it is easier to derive the local name from the label than vice versa.

If we want to stick to our proposed local name conventions, we will use the forms hasSsn, isCiaAgent, and Isbn10. These require human correction once the automated label generator has applied. If the latter runs before every release, we would need human intervention each time. Another option: add a skos:editorialNote indicating to the generator that the label should not be touched.

uscholdm commented 3 years ago

In fact, in general it is easier to derive the local name from the label than vice versa.

Interesting observation, it usually goes the other way, but this sounds correct.

The argument is that word boundaries can be easily detected. isCiaAgent allows word boundary detection, while isCIAAgent does not. Even for human users, the word boundary is easier to see in the latter.

I think it is easier to see the boundary in the former: isCiaAgent . Was that a typo?

rjyounes commented 3 years ago

Yes, that's an error. I've fixed it above.

rjyounes commented 3 years ago

Title case proposal above accepted for implementation.

rjyounes commented 3 years ago

Boris will fix all labels, first by automation and then manual adjustment for exceptions.

rjyounes commented 3 years ago

In writing the label validation script (see PR #428), Boris noted that proper nouns in labels must also retain capitalization. An emended version of the label conventions follows:

Title Case Convention

Capitalize: a. First and last words b. Words of four or more letters c. Second part of hyphenated word (e..g., Data-Centric, not Data-centric)
Lowercase: a. Articles: a, an, the b. Conjunctions: and, but, if, for, or, nor, so, yet c. Prepositions: as, at, by, cum, ere, for, in, of, off, on, out, per, pre, pro, qua, re, sub, to, up, via
Capitalize everything else

Label Conventions Classes: title case (as above) Properties: all lowercase

The following exceptions apply to both class and property labels:

Acronyms and proper nouns are kept intact (e.g., has SSN, unit symbol Unicode, ISBN-10)
Numbers are allowed (e.g., ISBN-10)
Hyphens are allowed (e.g., ISBN-10)

The exception for proper nouns makes the convention not fully automatable.

The implementation of these conventions in current labels will be done by Boris using a script with manual corrections (for the non-automatable exceptions). To support label validation as part of bundling the ontology for release, we will add an additional ontology file with an annotation signaling to the validation script that the label is not subject to the validation rules. We propose gist:nonConformingLabel for the annotation. See additional notes in PR #428.

Any objections to the annotation name should be voiced here.

semanticarts / gist

Update all labels according to a standard convention (replaces #20) #227

Classes

Properties