njtierney / syn

syn - the thesaurus
http://syn.njtierney.com/
51 stars 4 forks source link

homophones #24

Closed coolbutuseless closed 5 years ago

coolbutuseless commented 5 years ago

This is maybe getting out-of-scope, and i won't be at all offended when this PR is rejected.

Homophones are calculated by considering the phonetic encoding of the all_words list.

If 2 words have the same phonetic encoding then we consider them homophones.

Algorithms for phonetic encoding were DoubleMetaphone and Soundex.

It doesn't work too bad considering the diversity of english pronunciation, however we might have to ignore matches for any words with spaces in them. I believe the phonetic encoding just drops all spaces and treats it as a single word - which is definitely going to give bad matches.

library(syn)
hom('great')
#>  [1] "carat"        "card"         "carried"      "carried away"
#>  [5] "carrot"       "carroty"      "carry it"     "carry out"   
#>  [9] "carry to"     "carry weight" "cart"         "cart away"   
#> [13] "carte"        "Chiaretto"    "chord"        "cord"        
#> [17] "Corot"        "corrode"      "court"        "courtier"    
#> [21] "coward"       "cowherd"      "crate"        "create"      
#> [25] "credo"        "Credo"        "creed"        "crowd"       
#> [29] "CRT"          "crud"         "cruddy"       "crude"       
#> [33] "cruet"        "cry out"      "cry to"       "cue word"    
#> [37] "curate"       "curd"         "cured"        "curried"     
#> [41] "curt"         "garotte"      "garret"       "garrote"     
#> [45] "garrotte"     "Garuda"       "gourd"        "grad"        
#> [49] "grade"        "Grade A"      "grate"        "gray out"    
#> [53] "gray-white"   "grayed"       "grayout"      "greed"       
#> [57] "greedy"       "greet"        "grid"         "grit"        
#> [61] "gritty"       "groat"        "grot"         "grotto"      
#> [65] "grout"        "guard"        "gyrate"       "key word"    
#> [69] "krait"        "Kraut"        "quart"        "quarto"      
#> [73] "queered"      "quirt"

Created on 2018-11-28 by the reprex package (v0.2.1)

njtierney commented 5 years ago

I like it! But I think it is out of scope for syn, although I do really like it. It also looks like there aren't any additional dependencies or anything, so perhaps it makes sense to include it in the same package, otherwise it would involve duplicating work just to create a separate package

I wonder if it would be worthwhile to consider a "words" set of R packages that emulate the words organisation. This would open up the scope for things like:

njtierney commented 5 years ago

OK so I reckon go ahead and add it!

But can you add the following:

njtierney commented 5 years ago

I'm going to move my thoughts earlier into an issue

coolbutuseless commented 5 years ago

I think this implementation is still a bit half-baked. Retracting it until I can make it better.

coolbutuseless commented 5 years ago

I had a think about homophones, and by the time I found a good data source and figured out what I could do with it, I ended up having a package!

https://github.com/coolbutuseless/phon wraps the CMU pronouncing dictionary and generates:

I think this makes a good orthogonal/companion package to syn. i.e. syn finds new words based upon meaning, phon finds new words based upon sound.

njtierney commented 5 years ago

Love it!