Update to terminology around 'taxonomy', 'taxonomy terms' and 'vocabulary'

rasmus-storjohann-PG commented 6 years ago

As far as I can tell, in common usage, the AIRS is one taxonomy (much less frequently called a vocabulary) that contains a large number of entities that I believe are usually called taxonomy terms. In AIRS, each such term has an id (sometimes called code?) and a name. I'm new to this field, so I may not have gotten all of that right.

This naming doesn't align well with the naming used in the standard. The name of the taxonomy table implies that it contains one entry for each taxonomy, when it actually contains one entry for each taxonomy term. The vocabulary column identifies the taxonomy, so it would be much more consistent with common usage if this was called taxonomy_id, anticipating that this will become a foreign key once there is a table of data about the different taxonomies in use.

The description of the id column is unclear, what use cases are satisfied by prefixing the id that cannot be handled by the taxonomy_id (i.e. vocabulary) field?

The parent_name field seems unnecessary, since it states that the id field is unique, so parent_id should be sufficient. However, if the id prefixing thing is removed, it is possible that the id would no longer be unique. However in that case, the (id, taxonomy_id) together would be unique. It seems to me that we can reasonably assume that no taxonomy term is the child of a term from a different taxonomy, so the parent of any term can be looked up using the (parent_id, taxonomy_id) of the child.

timgdavies commented 6 years ago

Hello @rasmus-storjohann-PG - thanks for picking up on this issue.

You are right: the terminology in HSDS is not terribly consistent with wider usage. This is something that was inherited from earlier versions, but I'm marking this issue for consideration as part of next 2.0 cycle of updates (although we might be able to clean up some of the descriptions before that).

For the time being, hopefully the mapping below helps make clear the intended semantics of the current fields.

Changes we might consider for the next upgrade would be:

Table: service_taxonomy -> service_taxonomy_terms

Field: service_taxonomy.taxonomy_id -> service_taxonomy_term.term_id

Field: service_taxonomy.taxonomy_detail -> service_taxonomy_term.term_detail

Table:taxonomy -> taxonomy_terms

Field taxonomy.vocabulary -> taxonomy_terms.taxonomy

It's not easy to make those changes in a backwards compatible way, hence tagging this for 2.0.

Other issues

I've suggested above that we don't have a separate taxonomy table, and foreign key - but would instead be aiming to publishers to converge on a codelist of 'taxonomy' names to aid interoperability.

The number of top terms (terms without parents) could be used to identify the different taxonomies present in a system.

The redundancy of parent_name is a good question. Often data exchange standards will have some redundancy present, to recognise that users are often working with incomplete sets (e.g. if you get back the classification of a term from an API you might not get the full taxonomy tree, and having easy access to parent name can be useful). But - I'm not sure that justifies the inclusion here - so would welcome views on whether this can be safely dropped from next version.

There are some wider issues open about taxonomy terms and linked terms that might need to also be addressed here.

MikeThacker1 commented 4 years ago

First I'd like to concur that:

vocabulary and taxonomy are normally seen as synonymous and "Taxonomy" might be a better term to use. I also see the term "list" used and SKOS uses the term "concept scheme"
"taxonomy term" is better than "taxonomy" for a single term

Hence the name changes proposed by @timgdavies make sense.

Regarding using terms without parents as the identifiers for vocabularies/taxonomies, this implies, I think, giving what SKOS defines as top-level terms parent ids of what SKOS defines as concept schemes. It would work but it combines two types of entity (a concept scheme and a concept) in one field.

At present in our early implementation, we're just selecting distinct vocabulary references to get a unique list of vocabularies, but we have no links for those vocabularies. Hence a resolution using @timgdavies's approach or a separate table of vocabularies would really help

NeilMcKLogic commented 4 years ago

I like the precision these changes introduce, in general.

I do agree that retaining the parent_name and maybe adding parent_id is useful. The current spec calls for including the entire taxonomy system in any export so that references to its terms can be "looked up" by the receiving system. But this introduces two problems. 1) not all taxonomy systems are freely licensed to be distributed and 2) if you are only sending a small dataset (say, a single record) then the size of the reference taxonomy could be far larger than the actual payload of the record itself.

mrshll1001 commented 11 months ago

I am closing this as I believe the we have introduced these changes already. In 3.0 at least, there are separate taxonomy and taxonomy_term schemas with an appropriate relationship. vocabulary is no longer present, and is replaced by other fields which provide either a relationship to the taxonomy object, or a free-text description of the existing taxonomy.

openreferral / specification

Update to terminology around 'taxonomy', 'taxonomy terms' and 'vocabulary' #181