pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.29k stars 776 forks source link

Add support for file handle to pyannote.audio.core.io.Audio #564

Closed hbredin closed 3 years ago

hbredin commented 3 years ago

This is not currently supported:

from pyannote.audio.core.io import Audio
from pyannote.core import Segment
audio = Audio()
with open('file.wav', 'rb') as f:
    waveform, sample_rate = audio(f)
with open('file.wav', 'rb') as f:
    waveform, sample_rate = audio.crop(f, Segment(10, 20))

One has to do this instead:

from pyannote.audio.core.io import Audio
from pyannote.core import Segment
audio = Audio()
waveform, sample_rate = audio('file.wav')
waveform, sample_rate = audio.crop('file.wav', Segment(10, 20))

This is a limitation that might be problematic (e.g. with streamlit.file_uploader that returns a file handle)

hbredin commented 3 years ago

This might also be a good time to switch to upcoming torchaudio default sox_io backend.

cc @mogwai wanna have a look?

hbredin commented 3 years ago

Related to https://github.com/pytorch/audio/pull/1158

hbredin commented 3 years ago

torchaudio nightly seems to support this already:

pip install --pre torchaudio -f https://download.pytorch.org/whl/nightly/torch_nightly.html

mogwai commented 3 years ago

Do you want to wait to do this when the torchaudio version comes out where sox_io is the default backend?

hbredin commented 3 years ago

sox_io is already available in 0.7 so I guess we do not have to wait to make the switch.

mogwai commented 3 years ago

It becomes the default in torchaudio 0.8.0 for linux or osx. I remember that soundfile was faster in my benchmarks. Maybe it isn't in the newer versions.

mogwai commented 3 years ago

We're back to soundfile for the increased speeds.

So essentially you want to be able to read from streams with the io module?

hbredin commented 3 years ago

Yes.

mogwai commented 3 years ago

This will be supportable in torchaudio 0.8.0 which is to be released fairly soon?

pip install --upgrade --pre torchaudio -f https://download.pytorch.org/whl/nightly/torch_nightly.html
sudo apt install libncurses5

Then you can use this to test it:

import torchaudio
torchaudio.set_audio_backend("soundfile")
torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE = False
with open('tests/data/dev00.wav', 'rb') as f:
    wav, sr = torchaudio.load(f)

print(wav.shape, sr)
hbredin commented 3 years ago

Thanks for looking into this.

However, I think (but maybe I am wrong) that switching to torchaudio 0.8.0 will not be enough. We also have to support file-like objects everywhere AudioFile is used (and in pyannote.audio.core.io in particular).

mogwai commented 3 years ago

We should be able to implement this with torchaudio 0.8.0. It might mean breaking changing for the AudioFile io API. I found these relavent PR's/Issues:

https://github.com/pytorch/audio/issues/1072 https://github.com/pytorch/audio/pull/1158

mthrok commented 3 years ago

Hi

I am trying to push the file-like object support included in 0.8.0 release, but some tests are failing randomly and I am still trying to figure out why. I am keeping the record https://github.com/pytorch/audio/issues/1229.

The thing is that for some audio formats, the decoding finishes before it loads all the available data so it does not return the expected number of frames. This kind of bug is hard to notice for end users yet it could have devastating impact. (like getting wrong evaluation numbers)

So there is a good chance that I have to remove the file-like object support from 0.8.0.

mthrok commented 3 years ago

Regarding https://github.com/pytorch/audio/issues/1072, it is more like survey as we want to do something towards streaming support but we are not sure where to start. If you have a thought, feel free to leave your comments there. We appreciate any feedback.

hbredin commented 3 years ago

Thanks @mthrok for letting us know.

mthrok commented 3 years ago

Update on the file-like object support. I resolved the issue, so the feature will be included in 0.8.0 release, which is scheduled to happen the next week.

hbredin commented 3 years ago

Awesome. Thanks for the update @mthrok!

mogwai commented 3 years ago

torch 0.8.0 is released so this can be implemented

mthrok commented 3 years ago

Great. Let me know if you need help or find a bug on `torchaudio side.

karthikgali commented 3 years ago

Hi,

I am trying to calculate embeddings of given audio using XVectorMFCC VoxCeleb model (https://huggingface.co/hbredin/SpeakerEmbedding-XVectorMFCC-VoxCeleb). My input is an mp3 byte encoded string. Since the embeddings model takes only wav input. I converted it into wav BytesIO object. However, the model is not able to take wav BytesIO object. If I write the BytesIO wav into a file and provide it as an input to the model - it is able to provide embeddings (I am trying to avoid I/O operation here)

Could anyone help me with this? I am providing packages installed and code used.

Requirements: (pyannote) sh-4.2$ pip freeze | grep torch pytorch-lightning==1.2.7 pytorch-metric-learning==0.9.98 torch==1.8.1 torch-audiomentations==0.6.0torchaudio==0.8.1 torchmetrics==0.2.0 torchvision==0.9.1(pyannote) sh-4.2$ pip freeze | grep pyannotepyannote.audio @ https://github.com/pyannote/pyannote-audio/archive/develop.zip pyannote.core==4.1 pyannote.database==4.1 pyannote.metrics==3.0.1 pyannote.pipeline==2.0

Code: import ast import io from io import StringIO

from pydub import AudioSegment from pyannote.audio import Inference

model = Inference("hbredin/SpeakerEmbedding-XVectorMFCC-VoxCeleb", device="cpu", window="whole")

audio_bytes=ast.literal_eval(data) aud=AudioSegment.from_mp3(io.BytesIO(audio_bytes))

outputStream = io.BytesIO() aud.set_frame_rate(16000)[:5000].export(outputStream, format="wav")

embeddings = model(outputStream) #Failing here, It works if I write wav into a file and provide it as a input. print(embeddings[0])

hbredin commented 3 years ago

Closing this issue as PR #640 has just been merged.

@karthikgali, as long as soundfile supports mp3 (I did not check), it means that you can now do something loke:

from pyannote.audio import Inference
inference = Inference("hbredin/SpeakerEmbedding-XVectorMFCC-VoxCeleb", window="whole")
from pyannote.core import Segment
with open('audio.mp3', 'rb') as fp:
    embedding = inference.crop(fp, Segment(3, 5))