tdwg / dwc-qa

Public question and answer site for discussions about Darwin Core
Apache License 2.0
49 stars 8 forks source link

Use of capital letters in controlled vocabularies #169

Open EstebanMH-SiB opened 3 years ago

EstebanMH-SiB commented 3 years ago

I would like to know why some vocabularies (basisOfRecord and type) have the first letter in upper case, while the vast majority only use lower case (sex, lifeStage, behavior, taxonRank, pathway, etc). Should we use lower case for all controlled vocabularies besides basisOfRecord and type?

Rewarding this same topic, in our guidelines in Spanish, we have all controlled vocabularies with the first letter in upper case, is this a bad practice? Should we use lower case?

Thanks in advance for your answer, Esteban SiB Colombia @RicardoOrtizG @camiplata @SiBColombia

dagendresen commented 3 years ago

I believe that Darwin Core normally uses lower case first letter for properties and upper case first letter for classes.

gkampmeier commented 3 years ago

You will note that all of the terms where the first letter is in upper case are deprecated and it is recommended that they no longer be used (at least I didn't see any where this were not true). There is probably additional history here @tucotuco

baskaufs commented 3 years ago

I'd like to distinguish here between labels, property/class term local names, and controlled value strings.

Property, class, and controlled value terms all have labels that can be in a variety of languages. These strings are not controlled but are to help people understand what the term is.

Local names form the last part if the term IRI but commonly they are conflated with labels, particularly in Darwin Core. These are what @dagendresen was talking about. Examples: recordedBy is the local name for the property http://rs.tdwg.org/dwc/terms/recordedBy (abbreviated as dwc:recordedBy) and PreservedSpecimen is the local name for http://rs.tdwg.org/dwc/terms/PreservedSpecimen (dwc:PreservedSpecimen). In nearly all TDWG vocabulary terms, property and class terms follow the rules that property local names are in "lower camelCase" and class local names are in "upper CamelCase".

Controlled value strings are particular strings designated to be used in spreadsheets as values for a property. There are currently no "rules" for controlled value strings, although there are some best practices that we are trying to follow in the cases where controlled vocabularies have been adopted as TDWG standards. You can look at some of the examples at https://dwc.tdwg.org/em/ and https://dwc.tdwg.org/pw/ . In ratified controlled vocabularies, there are not different controlled value strings for different languages (that's conflating them with labels). In new controlled vocabularies, we are trying to avoid spaces and non-alphanumeric characters. That often takes the form of lower camelCase of an English phrase since it follows that rule and has consistent capitalization.

The case of values for dwc:basisOfRecord is atypical for controlled vocabularies. In the case of controlled vocabularies other than dwc:basisOfRecord, the controlled value will be a concept that can be represented by either the controlled value string (e.g. releasedForUse) or an opaque IRI identifier (http://rs.tdwg.org/dwcpw/values/p007). But dwc:basisOfRecord is essentially a description of the type of the subject resource. By convention, values for type properties are classes. By historic convention, the recommended values for dwc:basisOfRecord has been to use the local name of Darwin Core or Dublin Core class terms. Since as I described previously the local names of Darwin Core class are in upper CamelCase, that's why they are in that form.

There are some cases where the controlled vocabularies have historical reasons for not following the conventions that I described. For example, historically the controlled value string for dc:format in Audubon Core is supposed to be a MIME type (Internet Media type). Since MIME types have non-numeric slashes in them, the controlled values have slashes in them. In another example, the controlled values for ac:variantLiteral were specified in upper CamelCase in the original specification, so we kept them to avoid breaking past implementations (even though they don't follow the pattern of lower camelCase used in new controlled vocabularies).

jocelynpender commented 2 years ago

@baskaufs Thank you for your explanation above. I may need to re-read it a few times :)

I have a related question and I'm hoping for your thoughts. When massaging and curating text data, or even developing a vocabulary list of terms (controlled value strings, like https://dwc.tdwg.org/em/), where can I find a description of best practices? My intuition tells me that lowercase is better than sentence case (less text preprocessing required when using the data), and it seems that camelCase is what the community prefers for controlled value strings. Is there someplace to see these best practices explained? Why is camelCase preferred? How do I justify this when working with a team to develop standards?

Thank you!

baskaufs commented 2 years ago

@jocelynpender Just to be completely honest, I don't think there are any "official" rules about controlled vocabularies, at least in TDWG. If you look SKOS, which I would consider an "official" standards about controlled vocabularies (technically thesauri), they don't even use controlled value strings. They only have multilingual labels and IRIs for the concepts. However, most lay persons who are talking about "controlled vocabularies" are expecting someone to tell them a single string that they should always use in a database or spreadsheet. So in official TDWG controlled vocabularies (the 6 that exist so far), we mint an IRI, but also specify a "controlled value string" that meets people's expectations about there being a single string to use.

I was involved in creating all 6 of those first controlled vocabularies and since we didn't have any precedent to follow, I just asked myself where the problems occur with variations in strings used as values. They seemed to be:

The solution some of these seemed obvious. Don't use punctuation, use only ASCII characters (preferably only Latin characters), and don't use spaces. In general, the solution for capitalization would seem to be to only use lower case, and that was what we did when the controlled values were based on a single word.

However, it made sense in some cases for the controlled string to be formed from a short phrase. In that case, if we wanted to avoid spaces there were only a few options: dashes (e.g. released-for-use), snake case (e.g. released_for_use), or camelCase (releasedForUse). Of these three, camelCase seemed the most "safe" since dashes and underscores have special meanings or uses in some programming languages. Most of the time the camelCase construction is completely obvious, which makes it a good choice. We had to decide on something, so we just did it.

So I would say that we now have a precedent (camelCase) for controlled value strings within TDWG for "official" controlled vocabularies. So I think it would be great to stick with it because if people get used to camelCase, they can remember or even anticipate what the correct string should be.

I should note that the officially ratified controlled vocabularies is only part of the picture. There are many Darwin Core terms that say "Best practice is to use a controlled vocabulary", but no official TDWG controlled vocabulary exists. There is an effort that @pzermoglio and @timrobertson100 and probably others are involved in to create vocabularies of values based on actual string usage in GBIF. At some point, those vocabularies could become ratified controlled vocabularies if they were stable enough. But some may never reach that point. I'm not sure what patterns they have been following on the strings they recommend (for example, if most people are using strings with spaces in them, do they just go with it or try to enforce something like camelCase?). They may want to comment here on their criteria.

jocelynpender commented 2 years ago

Thanks, @baskaufs for your thoughtful response! Has TDWG ever run into pushback on its adoption of camelCase for value strings or property/class term local names? Perhaps folks are using SQL databases that are case insensitive, which makes the adoption of Darwin Core for database field names, value strings etc. less comfortable (as opposed to snake_case).

baskaufs commented 2 years ago

@jocelynpender As far as I know, there hasn't been pushback about this. I hadn't been aware of this potential problem, otherwise we might have gone for snake_case instead. However, at this point the genie is somewhat out of the bottle since six of the controlled vocabularies went all the way through public comment and ratification without anyone raising this issue. So it might be difficult to change the ones that already exist.

I'm trying to think through how situations would play out where case insensitivity might be a problem. At least within a particular implementation it doesn't seem like it would be a problem with the controlled strings, as long as everywhere was case-insensitive. There aren't any cases in the controlled vocabularies where there is one controlled value string that has another one that differs only in capitalization. However, if there were an export from a database which lost the capitalization, that would be a problem.

A rather straightforward solution to the problem would be to exclusively use the IRI-valued analogs. They exist for all of the ratified vocabularies. We went with opaque IRI local names that consisted only of lower-case letters and numerals. None of the namespaces have uppercase either. So the problem you mentioned would not exist for the IRI values. At least in Audubon Core, the pre-existing recommendation was to use the IRI-valued terms in preference to the literal value terms. But I have no idea what's actually happening in the wild.

albenson-usgs commented 2 years ago

It's interesting that this hasn't come up before. When OBIS-USA had a PostgreSQL database we had this issue so we used snake_case instead. We also couldn't use class because that's a reserved term in PostgreSQL so it became taxa_class. As far as I remember the values in the tables are not affected by this, just the column headers. Since we used IPT to share the data and IPT can detect basis_of_record = basisOfRecord I think it wasn't too problematic to use snake case.

baskaufs commented 2 years ago

Interesting, @albenson-usgs . If the table values aren't affected, then the controlled value strings shouldn't be too much of a problem because they would just be table values. The issue of camelCase in column headers is unavoidable because there are tons of property local names that use camelCase and that's been a fait acompli for many years, predating the controlled vocabularies by a long time.

It seems like a good practice to make a mapping from column headers to properties anyway rather than depending on the header strings to determine the property. That's the case with meta.xml in IPT and in other systems. It's primarily a problem in Simple Darwin Core spreadsheets, and they by definition won't be SQL databases.