ropensci / av

Working with Video in R
https://docs.ropensci.org/av
Other
92 stars 10 forks source link

av_audio_convert multiple start_time/total_time #52

Open jwijffels opened 7 months ago

jwijffels commented 7 months ago

I've wrote this week audio.vadwebrtc as I needed a quick way to remove from audio files segments without voice as I need to transcribe audio files with audio.whisper and that model hallucinates on audio segments containing only silences.

I looked at this chunk of code in package av: https://github.com/ropensci/av/blob/58d702683261d23fa7620a42aabfe776705b50a7/src/video.c#L652-L670 and it only handles one start_time / total_time. My code looks like this to extract from an audio file only the part containing voice.

> library(av)
> library(audio.vadwebrtc)
> file <- system.file(package = "audio.vadwebrtc", "extdata", "test_wav.wav")
> vad <- VAD(file, mode = "normal")
> vad$vad_segments
  vad_segment start  end has_voice
1           1  0.00 0.08     FALSE
2           2  0.09 3.30      TRUE
3           3  3.31 3.71     FALSE
4           4  3.72 6.78      TRUE
5           5  6.79 6.99     FALSE
> 
> voiced <- subset(vad$vad_segments, vad$vad_segments$has_voice == TRUE)
> voiced$file <- sprintf("%s.wav", voiced$vad_segment)
> voiced
  vad_segment start  end has_voice  file
2           2  0.09 3.30      TRUE 2.wav
4           4  3.72 6.78      TRUE 4.wav
> for(i in seq_len(nrow(voiced))){
+     av_audio_convert(file, output = voiced$file[i], 
+                      start_time = voiced$start[i], 
+                      total_time = voiced$end[i] - voiced$start[i])
+ }
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\2.wav':
  Metadata:
    ISFT            : Lavf58.29.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 28 at timestamp 3.42sec - audio stream completed!
Output #0, wav, to 'D:\Jan\Dropbox\Work\RForgeBNOSAC\BNOSAC\audio.vadwebrtc\4.wav':
  Metadata:
    ISFT            : Lavf58.29.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Adding audio frame 26 at timestamp 6.79sec - audio stream completed!
>

Would it be possible technically to allow multiple start/total_times so that these are all combined in 1 file? So that I can write something like this: av_audio_convert(file, output = "test.wav", start_time = voiced$start, total_time =voiced$end - voiced$start), generating 1 output file?