mira-space / MiraData

Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"
https://mira-space.github.io/
GNU General Public License v3.0
349 stars 9 forks source link

Clarification needed on discrepancy between Figure 3 of the paper and the actual dataset clip durations. #13

Open hongluzhou opened 1 month ago

hongluzhou commented 1 month ago

Thank you for sharing the code and data! If I understand Figure 3 (from Section 3.2) correctly, it shows that there are over 50k clips with a duration longer than 180 seconds. However, when I checked 'miradata_v1_330k.csv', it seems there are only 35k clips exceeding 180 seconds. I'm confused by the discrepancy. Am I misunderstanding Figure 3?

df = pd.read_csv('miradata_v1_330k.csv', encoding='utf-8')
print(len(df))
# 330313 will be printed

filtered_df = df[df['seconds'] > 180]
print(len(filtered_df))
# 35548 will be printed