pliang279 / MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning
MIT License
462 stars 68 forks source link

Process mosei_senti_data.pkl to match the text id in mosei.hdf5 #39

Open ZhuoZHI-UCL opened 1 month ago

ZhuoZHI-UCL commented 1 month ago

If you are using the mosei_senti_data.pkl and want to get the raw text by matching the id in mosei.hdf5, please consider to use the following script to process the data.


file1 = pickle.load(open('data/mosei_senti_data.pkl', 'rb'))

data = file1['test']['id']

# keep the first element and add the num.
modified_data = []
counters = {}
for element in tqdm(data, desc="Processing elements"):
    key = element[0]
    if key not in counters:
        counters[key] = 0
    modified_data.append(f"{key}[{counters[key]}]")
    counters[key] += 1

file1['test']['id'] = np.array(modified_data)

with open('data/mosei_new.pkl', 'wb') as f:
    pickle.dump(file1, f)

print('all done!')