lookup word for the noisy vector and use it as synonym
Approach B
pick one word in string (lower cased)
lookup max_neighbors neighbors
filter by min_vector_score (Jaccard similarity between words' vectors)
filter by max_shingle_score (Jaccard similarity between words' k-shingle sets)
The Jaccard similarity scores are measuring two different properties. The fasttext is about word semantics (Lexik) and char-based similarity about identifying and excluding near duplicates. In other words, we want to detect alternative words with similar semantics but different spelling.
Approach C
pick one word in string (lower cased)
lookup the word's synonyms from dict [{wordaskey: [syn1, syn2, ...], ...}]
if not exist (see approach B)
lookup max_neighbors neighbors
filter by min_vector_score (Jaccard similarity between words' vectors)
filter by max_shingle_score (Jaccard similarity between words' k-shingle sets)
add results to dictionary
Approach C is just a technical feature to ensure that synonyms of frequent words are only computed once. It's a buffering mechanism.
The dict object (in the RAM) is stored in $HOME/augtxt_data/buffers/<some name>.json (HDD). In order to avoid frequent I/O operations, we load the dict once, and upsert also just once at the end of the program. There must some sort of reset function to delete a json file
Approach A
Approach B
max_neighbors
neighborsmin_vector_score
(Jaccard similarity between words' vectors)max_shingle_score
(Jaccard similarity between words' k-shingle sets)The Jaccard similarity scores are measuring two different properties. The fasttext is about word semantics (Lexik) and char-based similarity about identifying and excluding near duplicates. In other words, we want to detect alternative words with similar semantics but different spelling.
Approach C
[{wordaskey: [syn1, syn2, ...], ...}]
max_neighbors
neighborsmin_vector_score
(Jaccard similarity between words' vectors)max_shingle_score
(Jaccard similarity between words' k-shingle sets)Approach C is just a technical feature to ensure that synonyms of frequent words are only computed once. It's a buffering mechanism.
The
dict
object (in the RAM) is stored in$HOME/augtxt_data/buffers/<some name>.json
(HDD). In order to avoid frequent I/O operations, we load thedict
once, and upsert also just once at the end of the program. There must some sort of reset function to delete a json file