unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
https://unum-cloud.github.io/uform/
Apache License 2.0
1.01k stars 60 forks source link

Releasing training dataset #5

Closed skull8888888 closed 1 year ago

skull8888888 commented 1 year ago

First of all, great work and congrats on the release. I was wondering whether you are planning on releasing the cleaned up 4M dataset?

kimihailv commented 1 year ago

Hello. Thank you for your interest. We aren't going to release our training dataset. However we can reveal some details:

1) Our dataset consists of public available datasets: COCO, Visual Genome, SBU (~800k samples totally) + CC12M (3.2M) 2) We use CLIP scores for filtering, also we use some ideas for filtering from this paper https://arxiv.org/pdf/2207.07635.pdf

PoetCoderJun commented 1 year ago

Hello. Thank you for your interest. We aren't going to release our training dataset. However we can reveal some details:

  1. Our dataset consists of public available datasets: COCO, Visual Genome, SBU (~800k samples totally) + CC12M (3.2M)
  2. We use CLIP scores for filtering, also we use some ideas for filtering from this paper https://arxiv.org/pdf/2207.07635.pdf

Will there be a paper released later?