what is the logic behind "align_word_embedding()"?

hi, can you please give some insights on this function?

the code is not documented. it makes it difficult to understand what is being done especially without data of failed previous part of the code. from abaw5_preprocessing/base/speech.py:

def align_word_embedding(input_path, fps, annotated_idx):
    if os.path.isfile(input_path):
        annotated_time_stamp = 1 / fps * np.array(annotated_idx) * 1000
        df = pd.read_csv(input_path, header=None, sep=";", skiprows=1)
        aligned_embedding = np.zeros((len(annotated_idx), 768),
                                     dtype=np.float32)

        for idx, stamp in tqdm(enumerate(annotated_time_stamp),
                               total=len(annotated_time_stamp)):
            embedding = np.zeros((1, 768), dtype=np.float32)
            diff = np.sum(np.asarray((stamp - df.values[:, :2] > 0), dtype=int),
                          axis=1)
            idx_nearest = np.where(diff==1)[0]
            if len(idx_nearest) > 0:
                if len(idx_nearest) > 1:
                    idx_nearest = idx_nearest[0]
                embedding = df.values[idx_nearest, 4:]

            aligned_embedding[idx] = embedding
    else:
        aligned_embedding = np.zeros((len(annotated_idx), 768),
                                     dtype=np.float32)

    return aligned_embedding

what is this difference:

diff = np.sum(np.asarray((stamp - df.values[:, :2] > 0), dtype=int),
                          axis=1)

is sums timestamps with bert features. although, the 2 first components could be something else since the words have been included as first element. but then, the resulting matrix has been modified before being stored.

given a bert feature embedding of size n, 768 where n is the number of tokens. how do you align the embedding to each frame? it seems you use some idx_nearest to pick the index of the token to get its corresponding features. but, it is not clear what is 'diff'.

thanks

sucv / ABAW3

what is the logic behind "align_word_embedding()"? #16