mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets
http://datacomp.ai/
Other
588 stars 49 forks source link

Average caption length for CommonPool #81

Closed BIGBALLON closed 2 months ago

BIGBALLON commented 3 months ago

Is there any dataset analyse for CommonPool(small / medium/ large /xlarge), especially the average caption length?

sagadre commented 2 months ago

Hi @BIGBALLON, check out Appendix I of the paper for statistics: https://arxiv.org/abs/2304.14108. The average caption length for the small pool is 19.60 tokens. Hope this helps!