introlab / odas

ODAS: Open embeddeD Audition System
MIT License
776 stars 246 forks source link

Respeaker microphone array didn't work and how to modified to send the sound source direction and the separated speech to other device #55

Open chad1023 opened 6 years ago

chad1023 commented 6 years ago

Hello,

I am trying to do a project that use a circular microphone array to track and recognize speech from multiple directions at the same time. So I want to modify ODAS_WEB and ODAS_CORE to send the sound source direction and the separated speech to other device.

Is ODAS support for Respeaker microphone array? I connect the microphone array with computer and run ODAS_WEB in Ubuntu, but it didn't work.

2018-05-13 2 33 04

BTW, which part of the code in ODAS_CORE and ODAS_WEB should I modify? Thanks

FrancoisGrondin commented 6 years ago

Yes ReSpeaker works with ODAS. Please provide us with more details so we can help you out. I've included @GodCed to this thread too.

GodCed commented 6 years ago

Hi, to send direction and separated speech to other device you don’t need to modify any code. ODAS_CORE already support this using sockets (that’s how data is passed to ODAS_WEB).

Also, for debuging I would recommend you start by running only ODAS_CORE from the command line using the verbose (-v) option.

chad1023 commented 6 years ago

@FrancoisGrondin I run the ODAS_WEB in Ubuntu virtual machine on mac with the Respeaker Microphone array v1. http://wiki.seeedstudio.com/ReSpeaker_Mic_Array/ The microphone is connected on the virtual machine that I can read the raw channel data.

And I tried to modify the config file "respeaker.cfg" in odas/config according to your sample

SSL

potential: {

        format = "json";

        interface: {
         type = "socket";
         ip = "127.0.0.1";
         port = 9001;
        };
 };

SST

 tracked: {

        format = "json";

        interface: {
        type = "socket";
        ip = "127.0.0.1";
        port = 9000;
        };

    };

SSS

separated: {

        fS = 44100;
        hopSize = 512;
        nBits = 16;        

        interface: {
        type = "socket";
        ip = "127.0.0.1";
        port = 10000;
        }        

    };

    postfiltered: {

        fS = 44100;
        hopSize = 512;
        nBits = 16;        

        interface: {

        type = "socket";
        ip = "127.0.0.1";
        port = 10010;
        }        

    };

The output on the terminal like this:

2018-05-14 2 00 09

And I also tied my local IP<10.xxx.xx.x>, but it didn't work, either.

GodCed commented 6 years ago

From what I can see, everything is allright on the ODAS_WEB side. You can see the four sockets connecting then disconnecting, meaning ODAS and ODAS_WEB can communicate.

Now what you’ll want to do is start ODAS_CORE from the command line, in a separated terminal, to see where it quits. Launch like this: path/to/odas_core -c path/to/your/config -v

Also, did you change the firmware on your respeaker array to have direct access to the raw audio channels?

chad1023 commented 6 years ago

Thanks for your timely reply.

We have already changed it. It run ODAS_WEB successfully after I changed the config about interface number. But there is no active source location.

2018-05-14 7 28 27

On the other hand, we failed to run ODAS_CORE on the command line. It stopped at the "Thread running" state.

2018-05-14 7 36 35

Should we run our receive side application to receive data via socket at the same time?

Meanwhile, how could we run and config the record function on ODAS_WEB? Could it record the sound from different channels?

GodCed commented 6 years ago

The « Threads running » is actually the normal state. It means the system is running and is processing audio.

For the active source location, it looks like a compatibility problem between your graphic driver and Web GL, as this is the only part of the interface that’s based on Web GL.

Does this animation loads if you access it from the VM’s web browser?

GodCed commented 6 years ago

For your application, you’ll want to run it in instead of ODAS Web. It must implements a socket server for each dataset you want to receive from the ODAS client (that you’ll launch from you app or from the command line).

As for the record function in ODAS Web, you access it with the record dialog. You then select a workspace and check the « enable recording » checkbox.

There’s no configuration, ODAS Web simply records every tracked source in two separated wav files: one for the separated audio, one for the post filtered.

chad1023 commented 6 years ago

The program crashed after a few seconds recording. And then restart the ODAS_WEB, it got the error message and could not catch the signal.

2018-05-14 8 31 22 2018-05-14 8 34 00 2018-05-14 8 29 45

(I have the newest version of the ODAS_WEB.)

BTW, how could we send the sound stream in different channels(sources) with ODAS_CORE? Since we want to translate those different sound streams into text real time for speech recognition for our application

GodCed commented 6 years ago

Those error message means that the socket servers are still actives. Try killing all ODAS and Electron related process before re-launching ODAS web.

The separated and post filtered streams are raw audio stream, matching the format specified in the config file, which contains one channel per tracked source. Channel is silent when the matching source is inactive or not tracked. I would recommend you take a look at recording.js and audio-recorder.js in the odas_web project, as those files receive audio from the core and forward the data to google speech to text API (if enabled).

I recommend you use the separated audio, rather then the post filtered audio, for speech to text, as the artefact introduced by the post filtering will not work well for voice recognition, as most transcriptors are trained on natural voice.

chad1023 commented 6 years ago

Thank you for your detail reply and recommend. We will take a look at recording.js and audio-recorder.js," But we have some problem on recording. We use odas_web to test the recording for the different direction speech and got the wav files as the result. But the wav file sounds incorrect. The file will have -1.00s timestamp.

2018-05-15 4 54 20

Should we change other setting in the config file?

GodCed commented 6 years ago

This is strange. What happens if you open the WAV file in another software, such as Audacity? Does de file play? Is there data in it?

You could check if ODAS Web config (in the config window) matches your config file regarding the format of the audio sinks.

taospartan commented 6 years ago

I’ve used the respeaker array but first had to remove some capacitors from the board in order to get noise off the recordings, I detected the noise by carrying out test recording the 8 channels with arecord and viewing in audacity.

Seeed discontinued the array v1 due to the manufacturing issues but are releasing v2 shortly according to their website

On 15 May 2018, at 18:09, Cédric Godin notifications@github.com wrote:

This is strange. What happens if you open the WAV file in another software, such as Audacity? Does de file play? Is there data in it?

You could check if ODAS Web config (in the config window) matches your config file regarding the format of the audio sinks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

chad1023 commented 6 years ago

Update: We have changed the sample rate setting in separated and post-filtered and got the correct recording result successfully. Thanks for you help

So again, what we want to do for our project is that we want to separated speech from multiple source (2~3 people in different direction) and recognize them real time

I am wondering why you recommend us not to use the post-filtered file for recognition? Since we listened to our recording and found out that the post-filtered one actually got better separation. (We have already test two algorithm DDS and DGSS for beam-forming) Here is the recording we got (with DGSS-MS since it got better result so far), which we have two speakers standing with about 90 degrees. result.zip

Is there any way that we could improve the separation quality for pure separation method? Meanwhile, we also notice there is a function called potential sources energy ranges (we are now setting from 0 to 1), is this config affect the final result of our separation or how we could use it to improve our separation?

GodCed commented 6 years ago

First, I recommend to avoid using post-filtered audio for voice recognition because from our testing, the artefacts introduced by the post-filtering were problematic when used with a voice recognition software trained on natural voice. Your experience may vary depending on the software used, so feel free to experiment (we used Google Speech).

Secondly, you should check with @FrancoisGrondin for tips to improve separation. I also think there have been issues discussion the subject.

Finally, the "potential sources energy ranges" in ODAS Studio is purely for the interface. It allows you to filter which potential sources appear on the sphere. It doesn't affect the output of the ODAS system. Energy threshold for ODAS tracking can be tuned in the config file, but if your tracking is working as expected, then I would advise not to thinker with it for now.

FrancoisGrondin commented 6 years ago

I'm actually working on better separation in my research. This is the cocktail party effect, a 60 years old research area, and so far no one came out with "the" perfect solution (at least not in everyday environments, i.e. reverberant environments with noise). If you know both noise and speech priors, you can rely on complex ideal ratio mask to enhance speech, but this usually works only for one speech source competing with noise sources. When many speech sources are active simultaneously, you run into the permutation problem and recent neural networks methods also fail.

Now, as for the postfiltering vs separation streams. Separation is performed from a linear combination of the observations, while postfiltering modulates the gain across frequency bins. This instantaneous modulation of gain confuses the ASR, which has not been trained on such a dataset. The only way I believe this could be improved would be to train the backend (using Kaldi for instance) with a noisy dataset enhanced with postfiltering, in the hope that the artifacts introduced in the training set will somehow match those in the testing set. I have not try this yet, but maybe it's worth exploring this path.

chad1023 commented 6 years ago

We tried to use the socket to send the data to other device according to your advice. However, there is a problem that the different type data is passed through different port, how could we connected them? That is, how can we associate the direction data with the separated speech? (Ex: Which code should we trace?)

GodCed commented 6 years ago

The separated channel match the tracked source index in the json coming from the tracking module.

So on the tracking port you have an array of source being sent trough json. If you want to listen to the audio of, let’s say, the source at index 2 in the array, you have to use the data from channel 3 in the separated stream.

chad1023 commented 6 years ago

We failed in recognizing speech via separate file and post-filter file. And hope to know more about your experience:

Which file you choose in your video? Separate or postfiltered? What’s the ASR result in your separate file? (If possible, could I take your separate file?)

We just want to check if the mic array impacted the result of separation.

Is it possible to improve separate result with other array? (We use respeaker array with one microphone on it broken)

Thank you for continuous help.

FrancoisGrondin commented 6 years ago

Separate file will definitely produce better results that post-filtered one. Reason for this is that post-filtered signal introduces artifacts, which helps human to better understand, but makes the ASR fail as the dataset on which it is trained does not show this type of artifacts.

We made some tests in the lab, but we do not have for now a specific dataset we can send you to compare the ASR results. My suggestion is that you try feeding the ASR with the signal from a single channel, and then with the separated signal, and see how both WER compares. Speaking of WER, did you measure it? Are you using a general language model or trying to use that for specific keywords only?

Also, what backend are you using for ASR? An existing API from Google, Apple, IBM, etc. or Kaldi? If you are using Kaldi, the best way to improve ASR results is to retrain the dataset with sound sources that have been processed through the same decoding pipeline (i.e. separation). This way both training and testing datasets will be in the same domain.

chad1023 commented 6 years ago

Is there any paper talking about what you did in the ODAS for the separation and post-filtering? We want to know more about the reason why the signal became artifacts after post-filtering

BTW, is it possible to do separation from the exist channel data instead? We want to record the speech in previous and test it by different separation mode.

We use Google API and a general language model. The ASR results showed that it will failed only in the cocktail party condition. The ASR result with the signal from a single channel is good.

FrancoisGrondin commented 6 years ago

Please see the following paper for post-filtering: https://ieeexplore.ieee.org/abstract/document/1389723/

Not sure what you mean by "do separation from the exist channel data"? Can you explain?

Yes it is expected the cocktail party condition will lead to poor WER, because, as opposed to daily sound noise, the interfering speech sources in the cocktail party have similar time-freq properties, and this probably confuses the backend DNN used for extracting features.

chad1023 commented 6 years ago

Can your also provide the paper about the separation?

Odas core got the live data from the mic array and do separation . However, is it possible to do the separation off-line? That is, we will record the raw data from the mic array (it will be 8-channel data) and want to do separation by reading the files.

jan4984 commented 6 years ago

@chad1023 doas has file source, by configuration.

chad1023 commented 6 years ago

@jan4984
I saw this on the ODAS wiki section. Do you mean that? 2018-06-04 1 43 39 I have question that the sample pass the mic-raw data by one file "mics.raw". We recorded the mic array data with 8 wav files(one for one channel). How should we convert it into one .raw file?

@FrancoisGrondin

jan4984 commented 6 years ago

the pcm data for channels in interleaved. If you have 4 channels and L16 format, the pcm will in [channel1 signed short1][channel2 signed short1][channel3 signed short1][channel4 signed short1][channel1 signed short2]....

chad1023 commented 6 years ago

@jan4984 Could you explain more in detail or provide any sample for the data? We still have no idea how to convert the wav files to raw file.

jan4984 commented 6 years ago

Data Structure section here http://kom.aau.dk/group/05gr506/report/node21.html

chad1023 commented 6 years ago

According to the advice from the respeaker, we tried to record the audio by arecord: arecord -v -f S16_LE -c 8 -r 16000 -t raw -Dplughw:1 audio.raw

2018-06-05 11 36 58

and use ctrl+c to stop recording.

After that, we changed the config to read from file(audio.raw).

The ODAS GUI will showed that it successfully run by the data from file (The Tracking became limted) However, when we tried the separated recording from the recording windows. The separation wav will be all error (The file will have -1.00s timestamp).

@FrancoisGrondin What should I do to get the separated result when reading the raw data from file??

Thanks!

akbapu14 commented 5 years ago

@GodCed you mentioned something about setting an energy threshold - I'm assuming this directly correlates to sound sources not getting tracked if they're too far away from the microphone. How do I increase the sensitivity since in my application there are 2 people sitting about 5-6 feet away from the mic array.

GodCed commented 5 years ago

@akbapu14 There is two different energy threshold. The first one is in ODAS Web. It selects which potential source are displayed in the interface. It has no impact on the source tracking by ODAS. It is the slider the lives at the bottom right of the interface, in the third column.

The second one is in the ODAS configuration file. In the SST module configuration, you'll find an active and an inactive subsection. In there you'll find mu. This value tells ODAS what is the energy level of an active source that it should track, and what is the energy level of an inactive source that it should ignore or stop tracking. Playing with those you can fine tune the tracking module to your needs.

What I like to do is display only the potentials sources in ODAS Web, then adjust the energy threshold slider so that only the interesting sources appears. The lower boundary should be moved up until garbage is hidden. Then, upper boundary should be moved down until your source of interest is displayed almost red (energy is a gradient from blue to red). Your slider values can now be used as baseline values for the mu parameter in your ODAS config file (the lower boundary being inactive and the upper active).