hdmf-dev / hdmf-common-schema

Specifications for pre-defined data structures provided by HDMF.
Other
3 stars 8 forks source link

Rename `VocabData` to `EnumText` #29

Closed rly closed 4 years ago

rly commented 4 years ago

From discussion with @oruebel and @ajtritt

VocabData is restricted to string mappings and the name may be confusing in the ontologies world.

ajtritt commented 4 years ago

@bendichter Do you have any thoughts on this?

ajtritt commented 4 years ago

@rly @oruebel Can you remind me why "Vocabulary Data" is a confusing term for an array of text elements that comes from a set of unique text elements?

rly commented 4 years ago

@oruebel brought up the issue, so he would be able to shed more light. "Controlled vocabulary" has a clear definition as a selected list of terms, but "vocabulary" seems to be used colloquially to mean more than that, including relationships. The W3C says:

On the Semantic Web, vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and represent an area of concern.

There is no clear division between what is referred to as “vocabularies” and “ontologies”. The trend is to use the word “ontology” for more complex, and possibly quite formal collection of terms, whereas “vocabulary” is used when such strict formalism is not necessarily used or only in a very loose sense.

https://www.w3.org/standards/semanticweb/ontology

oruebel commented 4 years ago

The main reason was to avoid confusion of VocabularyData this with Ontologies. The EnumText is specifically for the situation where we have a set list of terms (e.g., tags) and the term EnumText makes this clear.

ajtritt commented 4 years ago

I don't like EnumText since enum types are typically used to define a set of constants. I think the way we are using the data is more like this:

https://torchtext.readthedocs.io/en/latest/vocab.html https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958

oruebel commented 4 years ago

One issue is that for Vocabularies folks will expect support for hierarchical organization of terms. Also, as we work to support more complex ontologies the term vocabulary will likely lead to confusion. I agree that EnumText may not be the most intuitive name for users. What this type does is to index a list of terms, so how about something like IndexedTerms ?

ajtritt commented 4 years ago

One issue is that for Vocabularies folks will expect support for hierarchical organization of terms.

But, this is not actually a controlled-vocabulary (that should be removed from the documentation) or a formalized vocabulary. The use of "vocabulary" here is based on the broadly-understood meaning of vocabulary, namely values that come from a set of unique words.

Here are example uses of this in three popular Python data science packages

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight=vocabulary https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list

Also, as we work to support more complex ontologies the term vocabulary will likely lead to confusion.

HDMF provides a tool of standardizing how data get stored. Storing the structure of ontologies and relationships between controlled-vocabularies is out of this scope. Any support for storing such relationships should be handled by domain- or ontology-specific schemas.

rly commented 4 years ago

Good points. VocabData is fine with me.

Here is another definition of "vocabulary" for ML: https://developers.google.com/machine-learning/data-prep/transform/transform-categorical