unissoft-bj / ihostsvc

system services & data services on ihost
0 stars 0 forks source link

现场录音和音频检索 #8

Open unissoft-bj opened 9 years ago

unissoft-bj commented 9 years ago

现场录音过程: 1,销售顾问事先打开app的录音功能;(可蓝牙mic或者蓝牙耳麦配合) 2,第二顾问或者前台检查销售顾问的录音状态,如果应开未开则远程打开(销售顾问可设置远程操作的允许时段) 3,app开始录音,定时长分段(10 second ~ 30 second)(可设置) 4,app把分段录音上传到ihost(出错重传) 5,ihost把录音分段连接到一定时长(时长可通过参数设置,此时长与待检索内容相关)后,成为音频素材,开始进入检索过程

音频检索: 1,为每个销售顾问建立靶标库。靶标库一开始为空,然后根据实际录音,人工截取其中的片段作为靶标。靶标音频片段加入靶标库。此过程可多次重复,对检索算法进行训练 2,可选择靶标库的一个子集,作为活跃靶标库 3,从音频素材中检索活跃靶标库中的item 4,典型情况是,音频素材中会检索出两个item,一个在开头,代表上一个问题,后面是答案;然后出现下一个问题。设置两个指针,pointer a标记第一个问题的结尾,检索时从 pointer a 之后检索;pointer b标记第二个问题的开头 5,发现到第二个问题时,从pointer b处截断,把前面的部分保存;pointer b后面的部分,继续等待新segment的送达 6,重复的提问(检索出多个同样的问题,中间时间间隔小于3秒),为简单起见,按多个提问处理

分拣后的录音,应标记分类、开始时间,时长

michaelyin commented 9 years ago

BaiDu's API: 压缩格式 压缩格式 支持: 支持: pcm (不压缩) 、wav 、opus 、speex 、amr、x-flac 真实的语音数据 ,需要进行 base64 编码

provides 语音 识别服务 REST API

michaelyin commented 9 years ago

an open source project that might solve the problem. Needs testing on it. https://github.com/worldveil/dejavu

unissoft-bj commented 9 years ago

音频检索在4S店实战中的难度

1,同一意图 -> 多样内容,同样的问题会变化出多种措辞(台词) 2,同一内容 -> 多种表达,同样的措辞会变化出多种表达,语调、节奏、重音

也许需要multipule keywords + pattern的方式检索。 针对某个具体的销售顾问,他问某个问题时,针对不同的听众,会怎样措辞,怎样表演,经过一段时间的提炼,形成几种固定的模式,可能会有一个不错的检索效果。

我想3月底的demo版本,只做全程录音,不做检索。以唐山用户为样本,研究检索的事情,这样是否更现实一些

unissoft-bj commented 9 years ago

另一个思路,如果声纹识别说话个体的技术更成熟的话 我们第一步先把整个录音,按人分拣开 短促的背景噪音可能影响不大 持续的背景噪音的影响?

michaelyin commented 9 years ago

https://laplacian.wordpress.com/2009/01/10/how-shazam-works/

audio fingerprinting algorithm

michaelyin commented 9 years ago

Streaming audio message to a Java server on Windows 8:

  1. Need to open the firewall on Windows, for example, open port 8888 for UDP traffic;
  2. make sure the android phone is in the same wireless network as the java server.
michaelyin commented 9 years ago

speech recognition: http://ibillxia.github.io/blog/2012/11/24/several-plantforms-on-audio-and-speech-signal-processing/

cmu sphinx has chinese language model: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Mandarin%20Language%20Model/

http://stackoverflow.com/questions/25504310/cmusphinx-support-for-other-languages

CMU Sphinx 语音识别学习笔记 http://www.chenwang.net/2013/11/21/cmu-sphinx-%E8%AF%AD%E9%9F%B3%E8%AF%86%E5%88%AB%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0%EF%BC%881%EF%BC%89/

cubieboard cubie debian 下安装 PocketSphinx with GStreamer and Python 实现实时语音识别 http://www.chenwang.net/2013/12/22/cubieboard-cubie-debian-%E4%B8%8B%E5%AE%89%E8%A3%85-pocketsphinx-with-gstreamer-and-python-%E5%AE%9E%E7%8E%B0%E5%AE%9E%E6%97%B6%E8%AF%AD%E9%9F%B3%E8%AF%86%E5%88%AB/

michaelyin commented 9 years ago

Overview of audio mining: http://leavcom.com/articles/ieee_oct02.htm

"Virage's internal testing indicates a 5 to 20 percent error rate for processing news broadcasts and a 30 to 60 percent error rate for processing other content types."

michaelyin commented 9 years ago

http://kaldi.sourceforge.net/index.html

It has the audio index function.

https://smartech.gatech.edu/bitstream/handle/1853/52987/WENG-DISSERTATION-2014.pdf

michaelyin commented 9 years ago

Comparison of different recognition software: http://www.mico-project.eu/experiences-from-development-with-open-source-speech-recognition-libraries/

michaelyin commented 9 years ago

"Compared to the other recognizers, the outstanding performance of Kaldi can be seen as a revolution in open-source speech recognition technology. The system features next to all state-of-the-art techniques discussed in literature including LDA, MLLT, SAT, fMLLR, fMMI, and DNNs out of the box. Even being no expert, the provided recipes and scripts enable the usage of all these techniques to the user in short time." http://suendermann.com/su/pdf/oasis2014.pdf

michaelyin commented 9 years ago

http://staffhome.ecm.uwa.edu.au/~00014742/research/speech/local/Kaldi/kaldi-notes.html