sucv / ABAW3

We achieved the 2nd and 3rd places in ABAW3 and ABAW5, respectively.
21 stars 5 forks source link

what is the logic behind "align_word_embedding()"? #16

Open sbelharbi opened 2 months ago

sbelharbi commented 2 months ago

hi, can you please give some insights on this function?

the code is not documented. it makes it difficult to understand what is being done especially without data of failed previous part of the code. from abaw5_preprocessing/base/speech.py:

def align_word_embedding(input_path, fps, annotated_idx):
    if os.path.isfile(input_path):
        annotated_time_stamp = 1 / fps * np.array(annotated_idx) * 1000
        df = pd.read_csv(input_path, header=None, sep=";", skiprows=1)
        aligned_embedding = np.zeros((len(annotated_idx), 768),
                                     dtype=np.float32)

        for idx, stamp in tqdm(enumerate(annotated_time_stamp),
                               total=len(annotated_time_stamp)):
            embedding = np.zeros((1, 768), dtype=np.float32)
            diff = np.sum(np.asarray((stamp - df.values[:, :2] > 0), dtype=int),
                          axis=1)
            idx_nearest = np.where(diff==1)[0]
            if len(idx_nearest) > 0:
                if len(idx_nearest) > 1:
                    idx_nearest = idx_nearest[0]
                embedding = df.values[idx_nearest, 4:]

            aligned_embedding[idx] = embedding
    else:
        aligned_embedding = np.zeros((len(annotated_idx), 768),
                                     dtype=np.float32)

    return aligned_embedding

what is this difference:

diff = np.sum(np.asarray((stamp - df.values[:, :2] > 0), dtype=int),
                          axis=1)

is sums timestamps with bert features. although, the 2 first components could be something else since the words have been included as first element. but then, the resulting matrix has been modified before being stored.

given a bert feature embedding of size n, 768 where n is the number of tokens. how do you align the embedding to each frame? it seems you use some idx_nearest to pick the index of the token to get its corresponding features. but, it is not clear what is 'diff'.

thanks

sucv commented 1 month ago

Hi, the code populates the extracted word-level embedding according to the time stamps and a specific time interval.

Suppose you have N embeddings for N words. Each word was spoken at time step t_i. Your goal is to obtain an TxF feature vectors, so that in every t seconds (like in every 0.01s), there is a F-dimensional feature for it.

We know that N << T, we need to duplicate the embedding for Word N_i to populate all the feature vectors in-between t = N_i-1 and t = N_i. The code did just this.

Once you get the big picture, you may implement your own version. My code is terrible and is restricted by my pipeline.