fasttext based synonyms

Approach A

pick one word in string (lower cased)
compute the word's vector
add noise to word vector
lookup word for the noisy vector and use it as synonym

Approach B

pick one word in string (lower cased)
lookup max_neighbors neighbors
filter by min_vector_score (Jaccard similarity between words' vectors)
filter by max_shingle_score (Jaccard similarity between words' k-shingle sets)

The Jaccard similarity scores are measuring two different properties. The fasttext is about word semantics (Lexik) and char-based similarity about identifying and excluding near duplicates. In other words, we want to detect alternative words with similar semantics but different spelling.

Approach C

pick one word in string (lower cased)
lookup the word's synonyms from dict [{wordaskey: [syn1, syn2, ...], ...}]
if not exist (see approach B)
- lookup max_neighbors neighbors
- filter by min_vector_score (Jaccard similarity between words' vectors)
- filter by max_shingle_score (Jaccard similarity between words' k-shingle sets)
- add results to dictionary

Approach C is just a technical feature to ensure that synonyms of frequent words are only computed once. It's a buffering mechanism.

The dict object (in the RAM) is stored in $HOME/augtxt_data/buffers/<some name>.json (HDD). In order to avoid frequent I/O operations, we load the dict once, and upsert also just once at the end of the program. There must some sort of reset function to delete a json file

ulf1 / augtxt