lingualab / lingualabpy

Tools and utilities from the LINGUA laboratory
MIT License
1 stars 1 forks source link

Addition of more semantic features #9

Open clarkenj opened 5 months ago

clarkenj commented 5 months ago
  1. Semantic diversity (semD)

    • A measure of the distance between various contexts in which a word appears. Low diversity values indicate that a word is used in very narrow contexts (e.g. spinach), while high values indicate usage in more diverse contexts (e.g. predicament, and function words). (Hoffman et al., 2014)
    • Implementation based on latent semantic analysis reported in Hoffman et al. (2013). SemD values for 31,741 English words are provided.
    • Correlates with other psycholinguistic measures (frequency, imageability), but contributes independent variance.
    • Also reflects ambiguity - lower value words are less ambiguous.
    • Few studies in AD, but may increase with decreasing MMSE score (poster: Nevler et al., 2020)
  2. Contextual diversity

    • Similar to above, but without the cosine calculation for contexts in which the word appears, i.e. just the range. James et al. (2006)
    • There are values available for the SUBTLEXus corpus, the percentage of films the word appears in, but these do not appear to be split by content and function.
  3. Word Movers Distance

    • A measure of text distance based on word embeddings, i.e. similarity, underlined by the transportation optimization problem. Kusner et al. (2015)
    • I used it in my previous study between sentences and windows, and it was an important feature for classifying AD. I think mostly reflects semantic coherence. Clarke et al. (2021)
    • Can be implemented for word2vec with Gensim, and there is also a library to use other types of embeddings.

Note: these are still lexico-semantic features.

clarkenj commented 4 months ago

@SophiePellerin what do you think of these? Including both 1 and 2 might be redundant.

SophiePellerin commented 4 months ago

I agree that 1 and 2 essentially get to the same thing, maybe keeping just 1 would be enough. 3 is fine to keep because it gets to something different (semantic similarity regardless of whether the words are used in the same contexts or not), it's very possible or even likely that semantically similar words are used in similar contexts but it's also likely not always the case.