I have collected the download addresses for all the training data and posted them here for others to download conveniently.

Anymake commented 1 year ago

I am reproducing the model on V100 GPU. If anyone is doing the same, I hope we can communicate and exchange ideas together. My wechat : Anymake_ren 1、Flickr 30k ： http://shannon.cs.illinois.edu/DenotationGraph/data/index.html

2、The Visual Genome Dataset VG数据集主要由4个部分组成： Region Description：图片被划分成一个个region，每个region都有与其对应的一句自然语言描述。 Region Graph：每个region中的object、attribute、relationship被提取出来，构成局部的“Scene Graph”。 Scene Graph：把一张图片中的所有Region Graph合并成一个全局的Scene Graph。 QA：每张图片会有多对QA，分为两种类型：region-based和freeform。前者基于Region Description提出，与局部region的内容直接相关；后者则基于整张图片来提出。 https://homes.cs.washington.edu/~ranjay/visualgenome/api.html

3、LLaVA-CC3M-Pretrain-595K https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/tree/main

4、LLaVA-Instruct-150K 图片是COCO2014 https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main

5、CLEVR：该数据集为合成数据集，是由一些简单的几何形状构成的视觉场景。数据集中的问题总是需要一长串的推理过程，为了对推理能力进行详细评估，所有问题分为了5类：属性查询（querying attribute），属性比较（comparing attributes），存在性（existence），计数（counting），整数比较（integer comparison）。所有的问题都是程序生成的。该数据集的人为标注数据子集为CLEVR-Humans https://cs.stanford.edu/people/jcjohns/clevr/

6、GQA 图片20G， https://cs.stanford.edu/people/dorarad/gqa/download.html

7、Visual7W: Grounded Question Answering in Images Visual7W 是一个图像内容理解的数据集，通过对图像区域的文字描述和互相之间的关联，进行视觉问答 (Visual Question Answering) 任务，数据集中不仅包含图像本身，还包括图像区域内容相关的问答。 Visual7W 是 Visual Genome 数据集的一个子集，包含 47,300 张 COCO 数据集图像，327,929 个问答对，1,311,756 个人类生成的多选题，以及涵盖 36,579 个类别的 561,459 个 object groundings。 Visual7W 的问题主要由 What, Where, How, When, Who,Why, 以及 Which 构成。问题为多选，每个问题都有四个候选答案。 http://ai.stanford.edu/~yukez/visual7w/

8、VCR：Visual Commonsense Reasoning VCR 全称 Visual Commonsense Reasoning，是一个用于视觉常识推理的大规模数据集。该数据集提出了关于图像的具有挑战性的问题，机器需要完成两个子任务：正确回答问题以及提供理由证明其答案的合理性。 VCR 数据集包含大量问题，其中 212K 个用于训练，26K 个用于验证，25K 个用于测试。答案和理由来自超过 110K 个不重复的电影场景。 https://visualcommonsense.com/download/

9、VQAv2 dataset https://visualqa.org/download.html

10、VQA-E 全称 Visual Question Answering with Explanation，是带有解析的视觉问答数据集，其涉及的模型需要预测并生成答案解析。它是由 VQA v2 数据集自动衍生出来的，为每个 “图像-问题-答案三要素” 合成为一个文本解析，这使得问答过程更容易理解和可追溯。 COCO Images: Training images [83K/13GB], Validation Images [41K/6GB] https://github.com/liqing-ustc/VQA-E

11、VQA-X （2018） Multimodal Explanations: Justifying Decisions and Pointing to the Evidence VQA-X是一个既有文字解释又有Visual grounding的数据集, 图片是coco2014

GaoXiaoshan commented 1 year ago

补充一个 coco2014 国内下载地址，https://developer.aliyun.com/article/797577?accounttraceid=0c07a70a5c3b40df97d3692b1fb519d7ckem

GaoXiaoshan commented 1 year ago

Visual7W dataset。https://pan.baidu.com/s/1kVNUTrL 网盘密码：6wge

weisili2016 commented 4 months ago

tks

Edisonhimself commented 4 months ago

Visual7W dataset。https://pan.baidu.com/s/1kVNUTrL 网盘密码：6wge

请问shikra的llm是用哪个大模型，您能告知一下吗

zxrys commented 4 weeks ago

太感谢了！！！

shikras / shikra

I have collected the download addresses for all the training data and posted them here for others to download conveniently. #46