Closed DaddyWesker closed 3 years ago
Hi
Quick and dirty: yes. If you can be bothered creating annotations that are in the same format as the AVA videos then you can create the directory structure of the "download" directory, and stick your files in there, mimmicking an AVA video. Then you can generate the model performance metrics on that.
If you're thinking of doing slightly more interesting things, like attempting to use it as a (live) video classifier, then it requires a bit of code modification. I will sketch out roughly what I think needs to be done to get it going for that purpose.
Data To prepare your data pipeline, you will need to extract the video frames and MFCCs from the audio as you guessed. This needs to be at the frame rates and sampling rates your model expects. If you're trying to do it on live video/audio, then you probably want to do some kind of sliding window of frames/audio, to grab the frames and MFCCs. See https://github.com/tuanchien/asd/blob/master/extract.py for an idea of what needs to be done.
Predict Once you have your video frames and MFCCs, you need to put it in the right input format. This is basically going to be something like a list [video, audio] where video and audio are numpy arrays of the frames and MFCCs. See https://github.com/tuanchien/asd/blob/190c1c6d155b16a27717596d6350598e5cd4ffac/ava_asd/generator.py#L67
Next, you want to load the keras model and call the predict
api.
See https://keras.io/api/models/model_training_apis/#predict-method
Basically you load the model, call predict with the input, and it will spit out a list of predictions for each label, e.g., (0.3, 0.7), representing their scores for each class (speaking, not speaking). Those numbers should add to 1.
The examples in this project are roughly doing this, but to evaluate against other data. See https://github.com/tuanchien/asd/blob/190c1c6d155b16a27717596d6350598e5cd4ffac/evaluate.py#L56
Hope this helps!
@tuanchien thanks for your reply. I'll see what i can do following your idea... Hope i'll be able to infer model with my own data.
Could you also point me on AVA-files format which i should follow to mimmick avadata with my own frames?
Nevermind, i've found AVA paper with csv file structure. I guess, i'll try to handle this. I'll update my status later.
@tuanchien hello again. I was able to use your extract.py to generate all necessary files for my video. However, thee are two questions i'd like to ask:
In the evaluation.py there are three outputs from model - audio_pred, video_pred, main_pred. Is my thoughts are right, that audio_pred is the prediction of who is speaking right now based on audio channel only, video_pred is the same, but based on video_only and the main_pred is the prediction, based on mixture of audio and video?
This is kinda tricky. Since i want to evaluate full video, i've tried to make annotations for every frame. SO, when i'm launching extract annotations, i've commented delete_small_tracks and random.shuffle(data). But i wasn't been able to launch evaluate with such annotations since i was getting start = -13 in the generator.py, function insert_audio. If i'm leaving delete_small_tracks then it could be evaluated no problem. Could it be something related to 10 fps in config? DO i need to make it 1 fps instead and relaunch all extractions?
Thanks in advance.
P.S.: I've got third question. Model returns predictions as (16, 2). What this (,2) means? Probability of the first man speaking and second man speaking? Since i've got two track_ids. And 16, i guess, is batch size. Problem is that i need to draw rectangle on a speaking person so i need to correlate output with my video data. That's why i'm asking. Is there is a way to get image's path to each output? I can't debug into the generator.py though it is being used by keras model as i know...
@tuanchien
You need both audio and video data for this particular model to work. If you just had audio or just video you would want to use something simpler. You can for example build a model with just the video part of the model, or just the audio part of the model.
What I mean by extra footage is, if you have extra footage before frame 1 in your mp4 that you cropped out, you could stick that back in, and then start labelling from what used to be frame 1.
If you don't, then you could put in blank frames at the start if you want, but the predictions will almost certainly be terrible for those frames at the start.
Well, about audio/video only i meant that if in some particular frame sequence there are too noisy (for example, some sort of external noise from drill behind the wall) we could at least see what model's output in video_pred. Or we couldn't?
Well, i have 6 mins video and i've extracted pics and audio from it from start till the end. THough, there is some sort of step, since not every frame was taen from the video. I guess, it is the fps-option in config file.
I meant not like blank frames, but copy of the first frames (though how much?)
You could look at those video/audio only outputs sure. They might give an indication but I wouldn't rely on it as an accurate indication of what's going on. Would pay to manually inspect the problematic parts.
There is a parameter called ann_stride in the config.yaml that indicates how many frames to skip when generating annotations. This effectively controls the frame skip.
I would recommend just padding out the beginning by say 3s worth of frames and audio (conservative estimate). For each video frame, the audio feature extraction needs enough previous audio data to calculate the feature. https://github.com/tuanchien/asd/blob/master/ava_asd/generator.py#L205
Alright, thanks fr the answers. I was able to launch the model on several videos, though i needed to preprocess them using your extract code. Question is - how to avoid that? I mean, i need to launch it now in near real time situation, when we don't have the whole video. As i understand, model takes 5 frames as input resized to 100, 100 (i've checked that using cv2.imwrite on each frame in input), but i dont get what is the audio input. It has dimensions 13, 20 (i'm not considering (16, ), as it is the batch size and (,1)). How it is correlated with 5 frames? AS i'm seeing in your code, 20 is the sequence length of the audio input (like 5 for video frames). Is this means that 4 audio-frames on each video-frame? Sorry, if i'm bothering you too much =) Probably, as if i understood your last words about "audio feature extraction needs enough previous audio data to calculate the feature" it could mean that i'm right that for each video-frame we need several audio-frames. But question about (,13,) is still there. It is some kind of resize?
Nevermind, i've checked once again extract mfccs code and your paper. 13 you're getting from librosa.feature.mfcc. As i understand, it is some sort of preprocess of raw audio input. Now i need to preprocess my audio input the same way each time i want to launch model.
WEll, i've met some problems while trying to emulate real-time usage of your model. It would be really great if you could look at this code and tell me if i'm wrong somewhere:
video_in = cv2.VideoCapture("/mnt/fileserver/shared/datasets/AVA_v2.2/my_data/breaking_news.mp4")
csv_in = open("/mnt/fileserver/shared/datasets/AVA_v2.2/my_data/breaking_news_faces.mp4.csv", "r")
pcm, sr = librosa.load("/mnt/fileserver/shared/datasets/AVA_v2.2/my_data/breaking_news.mp4.wav", sr=None)
csv_in.readline()
fps = video_in.get(cv2.CAP_PROP_FPS)
spf = 1.0 / fps
num_of_frames = 5
audio_win_size = spf / 4.0
audio_stride = 0.001
sample_stride = int(sr * audio_stride)
window = int(sr * audio_win_size)
nmfcc = 13
apply_mean = True
apply_stddev = True
mfccs = librosa.feature.mfcc(pcm, sr, n_mfcc=nmfcc, n_fft=window, hop_length=sample_stride)
normalised_mfccs, mean, stddev = normalise_mfccs(mfccs, apply_mean=apply_mean, apply_stddev=apply_stddev)
continue_flag = True
count = 0
while continue_flag:
frames = []
video_frames = []
bbs = []
for _ in range(num_of_frames):
ret, frame = video_in.read()
if (not ret):
continue_flag = False
break
frames.append(frame)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
line = get_line_with_id(csv_in, 0)
x1 = int(line[2])
y1 = int(line[3])
x2 = x1 + int(line[4])
y2 = y1 + int(line[5])
crop = frame[y1:y2, x1:x2]
crop = cv2.resize(crop, (100, 100))
video_frames.append(crop)
if (not continue_flag):
break
video_frames = np.asarray(video_frames)
video_frames = np.expand_dims(video_frames, [0, 4])
audio_frames = normalised_mfccs[:, count:count+(num_of_frames*4)]
audio_frames = np.expand_dims(audio_frames, [0, 3])
count += num_of_frames*4
_audio_pred, y_video_pred, y_main_pred = model.predict([audio_frames, video_frames], verbose=1)
Function get_line_with_id gives me next line with speaker's id 0. So i'm sending only one speaker to the model every time. I'm trying to get 5 frames and corresponding audio (4 audio_frames for each video_frame). Problem is - model always returns that speaker is speaking. I'm currently trying to investigate where i'm wrong, but would be great if you could assist here. Thanks in advance.
I suggest to start with, double check all the parameters match what your model expects. e.g., the fps, and the functions you call are returning what you expect. You may want to visualise the video frames and audio array to see if they match what your annotation says.
Your while loop is also effectively skipping 5 frames each iteration. Make sure that is the behaviour you want.
Good luck debugging.
Hello again. I was on holidays, so just returned to that task.
My problem in that code was that my video_frames were [0:255] int32, not [0:1] float32. After fixing that, i could get some results. But i'm still have some questions regarding audio frames. In fact, about two parameters - mfcc_window_size and stride in your config file.
See https://github.com/tuanchien/asd/blob/190c1c6d155b16a27717596d6350598e5cd4ffac/extract.py#L137
the window size corresponds to n_fft and stride to the hop_length of librosa.feature.mfcc The relevant documentation for that is at https://librosa.org/doc/latest/generated/librosa.feature.mfcc.html?highlight=mfcc#librosa.feature.mfcc
Well, unfortunately there are no information on n_fft and hop_length on this page in your comment. Some info could be found here http://man.hubwiz.com/docset/LibROSA.docset/Contents/Resources/Documents/generated/librosa.core.stft.html But it is still not clear.
So, it seems my guess on how to get exact value of mfcc_window_size is wrong? Main question was why you've set those numbers for stride and mfcc_win_size, how you've calculated them... Oh, well. I guess i'll just test some my ideas and different values...
The parameters came from https://arxiv.org/abs/1906.10555 You can experiment with changing those values.
Seems like the parameters are relevant with which sampling rate you use. The MFCC window size is the time window multiplied by the sampling rate. Here is another paper implementation for reference.
https://github.com/Rudrabha/Wav2Lip/blob/deeec76ee8dba10cad6ef133e068659faf707f1e/hparams.py#L43
Thank you, @eddiecong . That clarify my question
Never mind, @DaddyWesker I am working on the same task as you, writing the inference codes of this model in real time. I will share my inference codes when I finished, hope we could create a PL of inference function.
@eddiecong Well, i've kinda finished doing inference that i needed. My code currently is:
I've currently placed same "window" to both n_fft=window, hop_length=window. And i'm getting exactly the number of features i need for the video.
Hello!
Thanks for sharing your model! Can you tell me, is it possible to run inference on my own mp4 file? I guess, i need to extract frames and audio from it as you do with downloaded data, is that right or is there is another way?