lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
MIT License
5.57k stars 643 forks source link

Ideas for datasets to uses? (WIT just came out) #110

Closed afiaka87 closed 3 years ago

afiaka87 commented 3 years ago

Hey all,

I'm compiling a list of the various datasets we'll need and how to download them:

Keep in mind, not all of these datasets ship with captions. However many of them do ship with a class descriptor of some type. I've only done mild testing with this, but usually you can just generate labels by doing something like "an image of {class_name}". Not sure what the best way to go about that would be though.

https://github.com/lucidrains/DALLE-pytorch/discussions/109

As it stands, this is turning out to be humongous. I just added the new Wikipedia dataset (11 million images).

Does anyone know of other captioned datasets we could use?

robvanvolt commented 3 years ago

The yfcc100m datasets can be added to the list: https://pypi.org/project/yfcc100m/

I try to download them at the moment, the above package makes it quite convenient!

python -m yfcc100m.download <meta-folder> -o <zip-folder>--filter 'lambda x: "0" in str(x["photoid"])[0] and len(x["description"]) >= 50'

Can be used after the metadatafile (65 GB) was downloaded.

{"photoid": 8140811444, "uid": "63066451@N04", "unickname": "Nikon+D600", "datetaken": "2011-05-22 08:08:01.0", "dateuploaded": "1351663754", "capturedevice": "NIKON+CORPORATION+NIKON+D3000", "title": "%E8%A5%BF%E6%B9%96%E6%B6%8C%E9%87%91%E5%B9%BF%E5%9C%BA%E4%B8%80%E8%A7%92", "description": "", "usertags": "", "machinetags": "", "longitude": null, "latitude": null, "accuracy": null, "pageurl": "http://www.flickr.com/photos/63066451@N04/8140811444/", "downloadurl": "http://farm9.staticflickr.com/8055/8140811444_a0d79bfa72.jpg", "licensename": "Attribution-NonCommercial-NoDerivs License", "licenseurl": "http://creativecommons.org/licenses/by-nc-nd/2.0/", "serverid": 8055, "farmid": 9, "secret": "a0d79bfa72", "secretoriginal": "3586eb413d", "ext": "jpg", "marker": 0, "key": "0afa8cdfaae4062b525eecfcb6fd"}

Are the tags, I'dont know what is the best filter, maybe x["description"] != "" would be enough. The filter used for the photoid is just to make sure that the download can be split in 10 chunks (photoid starting at 0 - 9) and does not have to be downloaded all at once.

Also, sometimes the usertags and the title are quite helpful, where you could generate captions like "This picture shows " or alike.</p> <p>Yashbonde suggested such a caption generator:</p> <pre><code>templates_labels = [ "a picture of {}", "a photo that has {}", "photo consisting of {}", "a low resolution photo of {}", "small photo of {}", "high resolution picture of {}", "low resolution picture of {}", "high res photo that has {}", "low res photo of {}", "{} in a photo", "{} in a picture", "rendered picture of {}", "jpeg photo of {}", "a cool photo of {}", "{} rendered in a picture", ] templates_maybe = [ *[x + " and maybe containing {}" for x in templates_labels], *[x + " and possibly containing {}" for x in templates_labels], *[x + " and {} but not sure" for x in templates_labels], *[x + " also roughly {}" for x in templates_labels], ] templates_indoor = [ "indoor picture of {}", "picture inside of {}", "picture of {} from inside", ] templates_food = [ "picture of {}, a food item", "photo of food {}", "nice photo of food {}", "picture of food item {}", "picture of dish {}", "picture of {}, a food dish", "gourmet food {}", ] templates_svhn = [ "a picture of house number '{}'", "number '{}' written in front of a house", "street house number '{}' written on a door", "a photo with number '{}' written in it", "number '{}' written on a door", "photograph of number '{}'" ]</code></pre> <p><a href="https://github.com/yashbonde/dall-e-baby/blob/master/generate_captions.py">https://github.com/yashbonde/dall-e-baby/blob/master/generate_captions.py</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/afiaka87"><img src="https://avatars.githubusercontent.com/u/3994972?v=4" />afiaka87</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>Thanks @robvanvolt, adding it. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/afiaka87"><img src="https://avatars.githubusercontent.com/u/3994972?v=4" />afiaka87</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>Yashbonde suggested such a caption generator:</p> <pre><code>templates_labels = [ "a picture of {}", "a photo that has {}", "photo consisting of {}", "a low resolution photo of {}", "small photo of {}", "high resolution picture of {}", "low resolution picture of {}", "high res photo that has {}", "low res photo of {}", "{} in a photo", "{} in a picture", "rendered picture of {}", "jpeg photo of {}", "a cool photo of {}", "{} rendered in a picture", ] templates_maybe = [ *[x + " and maybe containing {}" for x in templates_labels], *[x + " and possibly containing {}" for x in templates_labels], *[x + " and {} but not sure" for x in templates_labels], *[x + " also roughly {}" for x in templates_labels], ] templates_indoor = [ "indoor picture of {}", "picture inside of {}", "picture of {} from inside", ] templates_food = [ "picture of {}, a food item", "photo of food {}", "nice photo of food {}", "picture of food item {}", "picture of dish {}", "picture of {}, a food dish", "gourmet food {}", ] templates_svhn = [ "a picture of house number '{}'", "number '{}' written in front of a house", "street house number '{}' written on a door", "a photo with number '{}' written in it", "number '{}' written on a door", "photograph of number '{}'" ]</code></pre> <p><a href="https://github.com/yashbonde/dall-e-baby/blob/master/generate_captions.py">https://github.com/yashbonde/dall-e-baby/blob/master/generate_captions.py</a></p> </blockquote> <p>@robvanvolt @lucidrains This in particular is very useful. Is Yashbonde interested in helping with the project? I reached out to him on GitHub issues but haven't gotten a response. If anyone knows him let him know that we could use his help with the datasets. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/robvanvolt"><img src="https://avatars.githubusercontent.com/u/65366998?v=4" />robvanvolt</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>@afiaka87 I had a quick request a few weeks ago and got an answer within hours - I can try again and ask if he would like to participate!</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/TheodoreGalanos"><img src="https://avatars.githubusercontent.com/u/13436572?v=4" />TheodoreGalanos</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>The fashion dataset might be nice to perfect our mannequins: <a href="http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html">http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html</a></p> <p>I'm also trying to curate a building layouts dataset but it's taking a while. If I manage smth I'll share it in the list.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/yashbonde"><img src="https://avatars.githubusercontent.com/u/18363268?v=4" />yashbonde</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>Hello,</p> <p>@afiaka87 saw your comment on my repo, glad you reached out. Following are some relevant points with the dataset I had prepared:</p> <ul> <li>Number of unique sentences I generated were only a few and we would ideally need a wide variety</li> <li>When trained only on the COCO + Flickr8k + OpenGenome dataset the images generated from scratch only had a few textures and not any complete pattern. (I'll pull code and files from server and push them on my repo)</li> <li>You can get YFCC images from this <a href="https://drive.google.com/uc?export=download&id=1TXHt-0oLig4MbAHMRuBezdVjclYZzLFc">GDrive</a>, it's 66GB.</li> <li>When trained with the caption generator, it would generate images that had only one colour (only one token is predicted)</li> <li>I did not explore using paraphrasing engine for getting out more captions</li> <li>Wikipedia Dataset is huge and can be a great starting place.</li> </ul> <p>I'll help out with some parts just let me know. @robvanvolt, I agree let's build something kick-ass! ☄️🪐</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/robvanvolt"><img src="https://avatars.githubusercontent.com/u/65366998?v=4" />robvanvolt</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>Awesome @yashbonde! </p> <p>If we work together to create a MEGA public dataset (similar to the one used by Open-AI, or even better) consisting of many curated big datasets (yfcc100m, open images) and including a few specialized ones (house numbers, fashion, scenery, botanics, space, animals, ...), we might at some point have a really good basis for training, and can figure out how to increase batch size / effectiveness (128x128px maybe enough? higher compression rate? increase dataset even more by translating captions? Remove images apart from 1x1 aspect ratio ones for better training?). </p> <p>Maybe even google might borrow us the currently not publicly available dataset jft-300m ( <a href="https://paperswithcode.com/dataset/jft-300m">https://paperswithcode.com/dataset/jft-300m</a> ) in the near future... x)</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/afiaka87"><img src="https://avatars.githubusercontent.com/u/3994972?v=4" />afiaka87</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>Hello,</p> <p>@afiaka87 saw your comment on my repo, glad you reached out. Following are some relevant points with the dataset I had prepared:</p> <ul> <li>Number of unique sentences I generated were only a few and we would ideally need a wide variety</li> <li>When trained only on the COCO + Flickr8k + OpenGenome dataset the images generated from scratch only had a few textures and not any complete pattern. (I'll pull code and files from server and push them on my repo)</li> <li>You can get YFCC images from this <a href="https://drive.google.com/uc?export=download&id=1TXHt-0oLig4MbAHMRuBezdVjclYZzLFc">GDrive</a>, it's 66GB.</li> <li>When trained with the caption generator, it would generate images that had only one colour (only one token is predicted)</li> <li>I did not explore using paraphrasing engine for getting out more captions</li> <li>Wikipedia Dataset is huge and can be a great starting place.</li> </ul> <p>I'll help out with some parts just let me know. @robvanvolt, I agree let's build something kick-ass! ☄️🪐</p> </blockquote> <p>Fantastic thanks! We'd definitely love to use your code for generating captions as a start! </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/sorrge"><img src="https://avatars.githubusercontent.com/u/4583935?v=4" />sorrge</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>What's in the <code>yfcc_images.tar.gz</code> file? Can it be the entire YFCC100M?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jp-krow"><img src="https://avatars.githubusercontent.com/u/79276942?v=4" />jp-krow</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run?</p> <p>Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/yashbonde"><img src="https://avatars.githubusercontent.com/u/18363268?v=4" />yashbonde</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>So @afiaka87 @robvanvolt what's the next steps?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/afiaka87"><img src="https://avatars.githubusercontent.com/u/3994972?v=4" />afiaka87</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>@yashbonde </p> <p>edit: made a few comments on useful code here: <a href="https://github.com/yashbonde/dall-e-baby/issues/5">https://github.com/yashbonde/dall-e-baby/issues/5</a></p> <p>well, since my discovery that anything less than a large dataset is largely a waste of (my own personal) vast.ai compute, I'm still working on downscaling and re-uploading a lot of this. </p> <p>I think a good next step would be a DataLoaders folder in the root of our directory here containing the correct DataLoader for each dataset.</p> <p>@lucidrains Would that be a good idea in your opinion? or should we make a separate repository for training efforts?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/afiaka87"><img src="https://avatars.githubusercontent.com/u/3994972?v=4" />afiaka87</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run?</p> <p>Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?</p> </blockquote> <p>I <em>think</em> what you're getting at is essentially a distributed mesh training. As far as I know this is an unsolved problem ha.</p> <p>More reallistically, I think there <em>might</em> be a way to let multiple people train this? It couldn't really be just random folks on the internet. Would have to be a select few people downloading chunks of the dataset and training on them separately. I imagine however, that an effort like dalle-mesh is needed to really train this thing the way that OpenAI actually did. </p> <p>The GPTNeo folks are switching from TPU to GPUs. They could probably give us some insight/help in the coming months once those efforts are fleshed out. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/afiaka87"><img src="https://avatars.githubusercontent.com/u/3994972?v=4" />afiaka87</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>So @afiaka87 @robvanvolt what's the next steps?</p> </blockquote> <p>@yashbonde</p> <p>I've made a few comments on your dall-e-baby repository. <a href="https://github.com/yashbonde/dall-e-baby/issues/5">https://github.com/yashbonde/dall-e-baby/issues/5</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jp-krow"><img src="https://avatars.githubusercontent.com/u/79276942?v=4" />jp-krow</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <p>Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run? Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?</p> </blockquote> <p>I <em>think</em> what you're getting at is essentially a distributed mesh training. As far as I know this is an unsolved problem ha.</p> <p>More reallistically, I think there <em>might</em> be a way to let multiple people train this? It couldn't really be just random folks on the internet. Would have to be a select few people downloading chunks of the dataset and training on them separately. I imagine however, that an effort like dalle-mesh is needed to really train this thing the way that OpenAI actually did.</p> <p>The GPTNeo folks are switching from TPU to GPUs. They could probably give us some insight/help in the coming months once those efforts are fleshed out.</p> </blockquote> <p>I definitely agree with the part that random people can't just join in at their own leisure, there will obviously be a team / group that accepts honest people.</p> <p>Speaking of which, have you thought of making a dedicated team / group for your project?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/robvanvolt"><img src="https://avatars.githubusercontent.com/u/65366998?v=4" />robvanvolt</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <blockquote> <p>Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run? Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?</p> </blockquote> <p>I <em>think</em> what you're getting at is essentially a distributed mesh training. As far as I know this is an unsolved problem ha. More reallistically, I think there <em>might</em> be a way to let multiple people train this? It couldn't really be just random folks on the internet. Would have to be a select few people downloading chunks of the dataset and training on them separately. I imagine however, that an effort like dalle-mesh is needed to really train this thing the way that OpenAI actually did. The GPTNeo folks are switching from TPU to GPUs. They could probably give us some insight/help in the coming months once those efforts are fleshed out.</p> </blockquote> <p>I definitely agree with the part that random people can't just join in at their own leisure, there will obviously be a team / group that accepts honest people.</p> <p>Speaking of which, have you thought of making a dedicated team / group for your project?</p> </blockquote> <p>We try, as soon as the dataset is ready, form a group of 'trusted' people who collaborate / train together - of course many issues remain and still have to be resolved before. :)</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/afiaka87"><img src="https://avatars.githubusercontent.com/u/3994972?v=4" />afiaka87</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>While much of this conversation may be more appropriate on the discord, I'll just say that it's starting to look like my meager attempts at compiling a dataset were somewhat pale in comparison to the efforts of others who have reached out on there. </p> <p>It looks like we'll be able to obtain a dataset of the same scale as the OpenAI team without issue.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>