Run inference on my own data

DaddyWesker commented 3 years ago

Hello!

Thanks for sharing your model! Can you tell me, is it possible to run inference on my own mp4 file? I guess, i need to extract frames and audio from it as you do with downloaded data, is that right or is there is another way?

tuanchien commented 3 years ago

Hi

Quick and dirty: yes. If you can be bothered creating annotations that are in the same format as the AVA videos then you can create the directory structure of the "download" directory, and stick your files in there, mimmicking an AVA video. Then you can generate the model performance metrics on that.

If you're thinking of doing slightly more interesting things, like attempting to use it as a (live) video classifier, then it requires a bit of code modification. I will sketch out roughly what I think needs to be done to get it going for that purpose.

Data To prepare your data pipeline, you will need to extract the video frames and MFCCs from the audio as you guessed. This needs to be at the frame rates and sampling rates your model expects. If you're trying to do it on live video/audio, then you probably want to do some kind of sliding window of frames/audio, to grab the frames and MFCCs. See https://github.com/tuanchien/asd/blob/master/extract.py for an idea of what needs to be done.

Predict Once you have your video frames and MFCCs, you need to put it in the right input format. This is basically going to be something like a list [video, audio] where video and audio are numpy arrays of the frames and MFCCs. See https://github.com/tuanchien/asd/blob/190c1c6d155b16a27717596d6350598e5cd4ffac/ava_asd/generator.py#L67

Next, you want to load the keras model and call the predict api. See https://keras.io/api/models/model_training_apis/#predict-method Basically you load the model, call predict with the input, and it will spit out a list of predictions for each label, e.g., (0.3, 0.7), representing their scores for each class (speaking, not speaking). Those numbers should add to 1.

The examples in this project are roughly doing this, but to evaluate against other data. See https://github.com/tuanchien/asd/blob/190c1c6d155b16a27717596d6350598e5cd4ffac/evaluate.py#L56

Hope this helps!

DaddyWesker commented 3 years ago

@tuanchien thanks for your reply. I'll see what i can do following your idea... Hope i'll be able to infer model with my own data.

Could you also point me on AVA-files format which i should follow to mimmick avadata with my own frames?

Nevermind, i've found AVA paper with csv file structure. I guess, i'll try to handle this. I'll update my status later.

DaddyWesker commented 3 years ago

@tuanchien hello again. I was able to use your extract.py to generate all necessary files for my video. However, thee are two questions i'd like to ask:

In the evaluation.py there are three outputs from model - audio_pred, video_pred, main_pred. Is my thoughts are right, that audio_pred is the prediction of who is speaking right now based on audio channel only, video_pred is the same, but based on video_only and the main_pred is the prediction, based on mixture of audio and video?
This is kinda tricky. Since i want to evaluate full video, i've tried to make annotations for every frame. SO, when i'm launching extract annotations, i've commented delete_small_tracks and random.shuffle(data). But i wasn't been able to launch evaluate with such annotations since i was getting start = -13 in the generator.py, function insert_audio. If i'm leaving delete_small_tracks then it could be evaluated no problem. Could it be something related to 10 fps in config? DO i need to make it 1 fps instead and relaunch all extractions?

Thanks in advance.

P.S.: I've got third question. Model returns predictions as (16, 2). What this (,2) means? Probability of the first man speaking and second man speaking? Since i've got two track_ids. And 16, i guess, is batch size. Problem is that i need to draw rectangle on a speaking person so i need to correlate output with my video data. That's why i'm asking. Is there is a way to get image's path to each output? I can't debug into the generator.py though it is being used by keras model as i know...

tuanchien commented 3 years ago

Yep. The first two predictions are using audio only and video only to do the predictions. They are just there to help regularise the model. The main_pred is what you are interested in.
I would recommend adding a few seconds of extra footage to the start of your video. A single input point for video consists of an array of multiple frames. Audio is similar. This lets the model attempt to learn patterns on the sequence of frames leading up to the last frame. It's making a prediction on the very last frame in the sequence. This model uses one of the most basic ways of doing this. It's not great, but it's mostly just for demonstration. You can have a look at https://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Karpathy_Large-scale_Video_Classification_2014_CVPR_paper.pdf for some other simple ways to do it with CNN layers. The approaches taken by the Navier paper, and in the original model that came with the AVA dataset release (Google team) also use various recurrent layers to get better results. They have higher GPU ram requirements though.
If you are doing batch predictions with a batch size of 16, then it's probably giving you 16 results at a time. Each result will be a tensor that looks something like (0.1, 0.9) (hence the 2), representing its scores for speaking, not speaking. The prediction outputs are for whatever input it processed. You would need to modify the code to track what files are processed. So for example, in the generator class you could get it to print out the file names for the images it's going to return.

DaddyWesker commented 3 years ago

@tuanchien

So, i'm guessing i could use audio_pred and video_pred too if, for example, audio is missing?
Extra footage... well, could it be just duplicating of the first batch, for example?
ALright. Got it. Thanks!

tuanchien commented 3 years ago

You need both audio and video data for this particular model to work. If you just had audio or just video you would want to use something simpler. You can for example build a model with just the video part of the model, or just the audio part of the model.

What I mean by extra footage is, if you have extra footage before frame 1 in your mp4 that you cropped out, you could stick that back in, and then start labelling from what used to be frame 1.

If you don't, then you could put in blank frames at the start if you want, but the predictions will almost certainly be terrible for those frames at the start.

DaddyWesker commented 3 years ago

Well, about audio/video only i meant that if in some particular frame sequence there are too noisy (for example, some sort of external noise from drill behind the wall) we could at least see what model's output in video_pred. Or we couldn't?

Well, i have 6 mins video and i've extracted pics and audio from it from start till the end. THough, there is some sort of step, since not every frame was taen from the video. I guess, it is the fps-option in config file.

I meant not like blank frames, but copy of the first frames (though how much?)

tuanchien commented 3 years ago

You could look at those video/audio only outputs sure. They might give an indication but I wouldn't rely on it as an accurate indication of what's going on. Would pay to manually inspect the problematic parts.

There is a parameter called ann_stride in the config.yaml that indicates how many frames to skip when generating annotations. This effectively controls the frame skip.

I would recommend just padding out the beginning by say 3s worth of frames and audio (conservative estimate). For each video frame, the audio feature extraction needs enough previous audio data to calculate the feature. https://github.com/tuanchien/asd/blob/master/ava_asd/generator.py#L205

DaddyWesker commented 3 years ago

Alright, thanks fr the answers. I was able to launch the model on several videos, though i needed to preprocess them using your extract code. Question is - how to avoid that? I mean, i need to launch it now in near real time situation, when we don't have the whole video. As i understand, model takes 5 frames as input resized to 100, 100 (i've checked that using cv2.imwrite on each frame in input), but i dont get what is the audio input. It has dimensions 13, 20 (i'm not considering (16, ), as it is the batch size and (,1)). How it is correlated with 5 frames? AS i'm seeing in your code, 20 is the sequence length of the audio input (like 5 for video frames). Is this means that 4 audio-frames on each video-frame? Sorry, if i'm bothering you too much =) Probably, as if i understood your last words about "audio feature extraction needs enough previous audio data to calculate the feature" it could mean that i'm right that for each video-frame we need several audio-frames. But question about (,13,) is still there. It is some kind of resize?

Nevermind, i've checked once again extract mfccs code and your paper. 13 you're getting from librosa.feature.mfcc. As i understand, it is some sort of preprocess of raw audio input. Now i need to preprocess my audio input the same way each time i want to launch model.

DaddyWesker commented 3 years ago

WEll, i've met some problems while trying to emulate real-time usage of your model. It would be really great if you could look at this code and tell me if i'm wrong somewhere:

    video_in = cv2.VideoCapture("/mnt/fileserver/shared/datasets/AVA_v2.2/my_data/breaking_news.mp4")
    csv_in = open("/mnt/fileserver/shared/datasets/AVA_v2.2/my_data/breaking_news_faces.mp4.csv", "r")
    pcm, sr = librosa.load("/mnt/fileserver/shared/datasets/AVA_v2.2/my_data/breaking_news.mp4.wav", sr=None)
    csv_in.readline()
    fps = video_in.get(cv2.CAP_PROP_FPS)
    spf = 1.0 / fps
    num_of_frames = 5
    audio_win_size = spf / 4.0
    audio_stride = 0.001
    sample_stride = int(sr * audio_stride)
    window = int(sr * audio_win_size)
    nmfcc = 13
    apply_mean = True
    apply_stddev = True
    mfccs = librosa.feature.mfcc(pcm, sr, n_mfcc=nmfcc, n_fft=window, hop_length=sample_stride)
    normalised_mfccs, mean, stddev = normalise_mfccs(mfccs, apply_mean=apply_mean, apply_stddev=apply_stddev)

    continue_flag = True
    count = 0
    while continue_flag:
        frames = []
        video_frames = []
        bbs = []
        for _ in range(num_of_frames):
            ret, frame = video_in.read()
            if (not ret):
                continue_flag = False
                break
            frames.append(frame)
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            line = get_line_with_id(csv_in, 0)
            x1 = int(line[2])
            y1 = int(line[3])
            x2 = x1 + int(line[4])
            y2 = y1 + int(line[5])
            crop = frame[y1:y2, x1:x2]
            crop = cv2.resize(crop, (100, 100))
            video_frames.append(crop)
        if (not continue_flag):
            break
        video_frames = np.asarray(video_frames)
        video_frames = np.expand_dims(video_frames, [0, 4])
        audio_frames = normalised_mfccs[:, count:count+(num_of_frames*4)]
        audio_frames = np.expand_dims(audio_frames, [0, 3])
        count += num_of_frames*4
        _audio_pred, y_video_pred, y_main_pred = model.predict([audio_frames, video_frames], verbose=1)

Function get_line_with_id gives me next line with speaker's id 0. So i'm sending only one speaker to the model every time. I'm trying to get 5 frames and corresponding audio (4 audio_frames for each video_frame). Problem is - model always returns that speaker is speaking. I'm currently trying to investigate where i'm wrong, but would be great if you could assist here. Thanks in advance.

tuanchien commented 3 years ago

I suggest to start with, double check all the parameters match what your model expects. e.g., the fps, and the functions you call are returning what you expect. You may want to visualise the video frames and audio array to see if they match what your annotation says.

Your while loop is also effectively skipping 5 frames each iteration. Make sure that is the behaviour you want.

Good luck debugging.

DaddyWesker commented 3 years ago

Hello again. I was on holidays, so just returned to that task.

My problem in that code was that my video_frames were [0:255] int32, not [0:1] float32. After fixing that, i could get some results. But i'm still have some questions regarding audio frames. In fact, about two parameters - mfcc_window_size and stride in your config file.

mfcc_windows_size. In your case, it is 0.025. My thoughts on calculating this value were: mfcc_windows_size = 1/(4*fps), since we're asking 4 audio_frames for each video frame. Is that right?
stride (for audio). In your code, it is 0.01. Well, i'm kinda clueless here. How to calculate this one? If i have for example fps. Or it's not depends on fps at all? Thanks in advance. I hope that's the last one question.

tuanchien commented 3 years ago

See https://github.com/tuanchien/asd/blob/190c1c6d155b16a27717596d6350598e5cd4ffac/extract.py#L137

the window size corresponds to n_fft and stride to the hop_length of librosa.feature.mfcc The relevant documentation for that is at https://librosa.org/doc/latest/generated/librosa.feature.mfcc.html?highlight=mfcc#librosa.feature.mfcc

DaddyWesker commented 3 years ago

Well, unfortunately there are no information on n_fft and hop_length on this page in your comment. Some info could be found here http://man.hubwiz.com/docset/LibROSA.docset/Contents/Resources/Documents/generated/librosa.core.stft.html But it is still not clear.

So, it seems my guess on how to get exact value of mfcc_window_size is wrong? Main question was why you've set those numbers for stride and mfcc_win_size, how you've calculated them... Oh, well. I guess i'll just test some my ideas and different values...

tuanchien commented 3 years ago

The parameters came from https://arxiv.org/abs/1906.10555 You can experiment with changing those values.

eddiecong commented 3 years ago

Seems like the parameters are relevant with which sampling rate you use. The MFCC window size is the time window multiplied by the sampling rate. Here is another paper implementation for reference.

https://github.com/Rudrabha/Wav2Lip/blob/deeec76ee8dba10cad6ef133e068659faf707f1e/hparams.py#L43

DaddyWesker commented 3 years ago

Thank you, @eddiecong . That clarify my question

eddiecong commented 3 years ago

Never mind, @DaddyWesker I am working on the same task as you, writing the inference codes of this model in real time. I will share my inference codes when I finished, hope we could create a PL of inference function.

DaddyWesker commented 3 years ago

@eddiecong Well, i've kinda finished doing inference that i needed. My code currently is:

evaluate_sequently.txt

I've currently placed same "window" to both n_fft=window, hop_length=window. And i'm getting exactly the number of features i need for the video.

tuanchien / asd

Run inference on my own data #2