ulf1 / augtxt

yet another text augmentation python package
Apache License 2.0
2 stars 0 forks source link

fasttext based synonyms #31

Closed ulf1 closed 3 years ago

ulf1 commented 3 years ago

Approach A

  1. pick one word in string (lower cased)
  2. compute the word's vector
  3. add noise to word vector
  4. lookup word for the noisy vector and use it as synonym

Approach B

  1. pick one word in string (lower cased)
  2. lookup max_neighbors neighbors
  3. filter by min_vector_score (Jaccard similarity between words' vectors)
  4. filter by max_shingle_score (Jaccard similarity between words' k-shingle sets)

The Jaccard similarity scores are measuring two different properties. The fasttext is about word semantics (Lexik) and char-based similarity about identifying and excluding near duplicates. In other words, we want to detect alternative words with similar semantics but different spelling.

Approach C

  1. pick one word in string (lower cased)
  2. lookup the word's synonyms from dict [{wordaskey: [syn1, syn2, ...], ...}]
  3. if not exist (see approach B)
    • lookup max_neighbors neighbors
    • filter by min_vector_score (Jaccard similarity between words' vectors)
    • filter by max_shingle_score (Jaccard similarity between words' k-shingle sets)
    • add results to dictionary

Approach C is just a technical feature to ensure that synonyms of frequent words are only computed once. It's a buffering mechanism.

The dict object (in the RAM) is stored in $HOME/augtxt_data/buffers/<some name>.json (HDD). In order to avoid frequent I/O operations, we load the dict once, and upsert also just once at the end of the program. There must some sort of reset function to delete a json file