Closed rly closed 4 years ago
@bendichter Do you have any thoughts on this?
@rly @oruebel Can you remind me why "Vocabulary Data" is a confusing term for an array of text elements that comes from a set of unique text elements?
@oruebel brought up the issue, so he would be able to shed more light. "Controlled vocabulary" has a clear definition as a selected list of terms, but "vocabulary" seems to be used colloquially to mean more than that, including relationships. The W3C says:
On the Semantic Web, vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and represent an area of concern.
There is no clear division between what is referred to as “vocabularies” and “ontologies”. The trend is to use the word “ontology” for more complex, and possibly quite formal collection of terms, whereas “vocabulary” is used when such strict formalism is not necessarily used or only in a very loose sense.
The main reason was to avoid confusion of VocabularyData
this with Ontologies. The EnumText
is specifically for the situation where we have a set list of terms (e.g., tags) and the term EnumText
makes this clear.
I don't like EnumText
since enum types are typically used to define a set of constants. I think the way we are using the data is more like this:
https://torchtext.readthedocs.io/en/latest/vocab.html https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958
One issue is that for Vocabularies
folks will expect support for hierarchical organization of terms. Also, as we work to support more complex ontologies the term vocabulary will likely lead to confusion. I agree that EnumText
may not be the most intuitive name for users. What this type does is to index a list of terms, so how about something like IndexedTerms
?
One issue is that for Vocabularies folks will expect support for hierarchical organization of terms.
But, this is not actually a controlled-vocabulary (that should be removed from the documentation) or a formalized vocabulary. The use of "vocabulary" here is based on the broadly-understood meaning of vocabulary, namely values that come from a set of unique words.
Here are example uses of this in three popular Python data science packages
https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight=vocabulary https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list
Also, as we work to support more complex ontologies the term vocabulary will likely lead to confusion.
HDMF provides a tool of standardizing how data get stored. Storing the structure of ontologies and relationships between controlled-vocabularies is out of this scope. Any support for storing such relationships should be handled by domain- or ontology-specific schemas.
Good points. VocabData
is fine with me.
Here is another definition of "vocabulary" for ML: https://developers.google.com/machine-learning/data-prep/transform/transform-categorical
From discussion with @oruebel and @ajtritt
VocabData
is restricted to string mappings and the name may be confusing in the ontologies world.