memect / hao

好东西传送门
1.4k stars 461 forks source link

@付超群 不知道有没有中文发音相似度计算算法或者类库?比如北京 百斤 鼻颈 背景 如果可以顺道比较英文更好,比如peking,beking 感谢 #164

Closed haoawesome closed 10 years ago

haoawesome commented 10 years ago

私信

haoawesome commented 10 years ago

初步回答

概念

语音相似度 phonetic similarity 文献

http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html

http://saffron.insight-centre.org/acl/topic/phonetic_similarity/ Phonetic algorithms

https://homes.cs.washington.edu/~bhixon/papers/phonemic_similarity_metrics_Interspeech_2011.pdf Phonemic Similarity Metrics to Compare Pronunciation Methods (2011)

http://webdocs.cs.ualberta.ca/~kondrak/papers/lingdist.pdf Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification (2006)

http://webdocs.cs.ualberta.ca/~kondrak/papers/chum.pdf Phonetic alignment and similarity (2003)

中文

http://www.aclweb.org/anthology/P/P06/P06-1125.pdf A Phonetic-Based Approach to Chinese Chat Text Normalization 中文方法

http://aclanthology.info/papers/automatic-identification-of-phonetic-complements-for-chinese-characters-based-on-optimization-and-probability-distribution-in-chinese

语音相似度 phonetic similarity 算法与开源代码

screen shot 2014-09-10 at 6 27 34 pm

Soundex Daitch–Mokotoff Soundex Kölner Phonetik Metaphone - Double Metaphone New York State Identification and Intelligence System Match Rating Approach (MRA) Caverphone

https://github.com/elasticsearch/elasticsearch-analysis-phonetic/ -- java https://github.com/maros/Text-Phonetic -- perl https://github.com/dotcypress/phonetics -- go https://github.com/lukelex/soundcord -- ruby https://github.com/Simmetrics/simmetrics -- java https://github.com/oubiwann/metaphone - https://pypi.python.org/pypi/Metaphone/0.4 --python https://bitbucket.org/yougov/fuzzy - https://pypi.python.org/pypi/Fuzzy/1.0 --python https://github.com/sunlightlabs/jellyfish - https://pypi.python.org/pypi/jellyfish/0.3.2 -- python https://github.com/rockymadden/stringmetric - scala https://github.com/Yomguithereal/clj-fuzzy - Clojure https://github.com/NaturalNode/natural - Node javascript

source: wikipedia, github

haoawesome commented 10 years ago

http://en.wikipedia.org/wiki/Homonym In linguistics, a homonym is, in the strict sense, one of a group of words that share the same pronunciation but may have different meanings.

http://en.wikipedia.org/wiki/Soundex Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.

http://stackoverflow.com/questions/17010516/how-to-detect-how-similar-a-speech-recording-is-to-another-speech-recording How to detect how similar a speech recording is to another speech recording?

http://csl.ira.uka.de/fileadmin/Vorlesungen/WS2010-11/ATSP/presentations/IvayloJanev_HomophonesInASR.pdf How to solve homophone problems in Automatic Speech Recognition?

http://web.stanford.edu/class/cs124/lec/sem Word Meaning and Similarity Word Senses and Word Relations

https://github.com/lukelex/soundcord A phonetic algorithm to make comparison by phonetically similar terms easier.

http://www.psy.ntu.edu.tw/jtwu/jtwu/publish/%E6%9C%9F%E5%88%8A%E8%AB%96%E6%96%87/Chen%20Vaid%20&%20Wu%202009%20LCP%20Homophone%20Density.pdf Chen, Hsin-Chin, Vaid, Jyotsna and Wu, Jei-Tun(2009)'Homophone density and phonological frequency in Chinese word recognition',Language and Cognitive Processes,24:7,967 — 982

http://dl.acm.org/citation.cfm?id=1282081 A phonetic similarity model for automatic extraction of transliteration pairs 2007

https://twpl.library.utoronto.ca/index.php/twpl/article/download/6196/3185 Phonetic similarity and phonemic contrast in loanword adaptation Kevin Heffernan

haoawesome commented 10 years ago

http://www.aclweb.org/anthology/O00-1005 反向異文字音譯相似度評量方法與跨語言資訊檢索 (2000)

haoawesome commented 10 years ago

http://www.eejournal.ktu.lt/index.php/elt/article/viewFile/2628/1917 Predicting the Acoustic Confusability between Words for a Speech Recognition System using Levenshtein Distance

haoawesome commented 10 years ago

http://www.let.rug.nl/alfa/ling-distances/advertisement.html Workshop on Linguistic Distances, 2006

http://spraakbanken.gu.se/eng/research/digital-areal-linguistics/workshop-october-2011/program/program Workshop on comparing approaches to measuring linguistic differences 24-25 October 2011, University of Gothenburg

haoawesome commented 10 years ago

http://webdocs.cs.ualberta.ca/~kondrak Greg Kondrak Associate Professor

Department of Computing Science Athabasca Hall 221 University of Alberta Edmonton, Alberta, T6G 2E8 Canada

haoawesome commented 10 years ago

http://en.wikipedia.org/wiki/Phonetic_algorithm below are copied from wikipedia

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. They are necessarily complex algorithms with many rules and exceptions, because English spelling and pronunciation is complicated by historical changes in pronunciation and words borrowed from many languages. Among the best-known phonetic algorithms are:

haoawesome commented 10 years ago

《语音相似度算法与代码:第一版》 作者:好东西传送门 编号:hao-2014-002 时间:2014-09-11

phonetic similarity algorithm Soundex Daitch–Mokotoff Soundex Kölner Phonetik Metaphone - Double Metaphone New York State Identification and Intelligence System Match Rating Approach (MRA) Caverphone

implementations https://github.com/elasticsearch/elasticsearch-analysis-phonetic/ -- java https://github.com/maros/Text-Phonetic -- perl https://github.com/dotcypress/phonetics -- go https://github.com/lukelex/soundcord -- ruby https://github.com/Simmetrics/simmetrics -- java https://github.com/oubiwann/metaphone - https://pypi.python.org/pypi/Metaphone/0.4 --python https://bitbucket.org/yougov/fuzzy - https://pypi.python.org/pypi/Fuzzy/1.0 --python https://github.com/sunlightlabs/jellyfish - https://pypi.python.org/pypi/jellyfish/0.3.2 -- python

haoawesome commented 10 years ago

https://github.com/memect/hao/blob/master/awesome/phonetic_algorithm.md

haoawesome commented 10 years ago

问:@付超群 不知道有没有中文发音相似度计算算法或者类库?比如北京 百斤 鼻颈 背景 如果可以顺道比较英文更好,比如peking,beking 答: 关于算法和开源代码整理了一个 #脑图#,问答进展和相关资料在 http://memect.co/TL85MEp 还收录了一些相关论文(含汉语) 欢迎指正补充 http://www.weibo.com/5220650532/BmsMAeh0K?ref=

haoawesome commented 10 years ago

http://www.phon.ox.ac.uk/jcoleman/PHONOLOGY1.htm Phonetics vs. Phonology

haoawesome commented 10 years ago

http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html

haoawesome commented 10 years ago

常见的语音算法phonetic algorithm就是设定一组规则,将文字映射到某种音标符号系统。例如最原始的Soundex算法 扔掉所有元音,映射 b, f, p, v → 1 然后通过比较映射后符号串的差异来计算发音相似度。原帖中的脑图列举了常见英语(及德语)映射算法以及相关开源代码(python, java, go, ruby, perl) http://www.weibo.com/5220650532/BmLqi92Vx?mod=weibotime

guker commented 10 years ago

这个资源不错哦,现在我正好有个东西不知道怎么做呢?

bestFannie commented 2 years ago

请问有找到比较“北京”和“Peking”发音相似性的方法吗?

dongyuwei commented 2 years ago

推荐一下 https://yomguithereal.github.io/talisman/phonetics/ 里面提供了16种语音相似性算法。 我的 hallelujahIM 输入法使用了其中的 phonex 算法来实现英语语音模糊匹配功能。