Closed afiaka87 closed 3 years ago
The yfcc100m datasets can be added to the list: https://pypi.org/project/yfcc100m/
I try to download them at the moment, the above package makes it quite convenient!
python -m yfcc100m.download <meta-folder> -o <zip-folder>--filter 'lambda x: "0" in str(x["photoid"])[0] and len(x["description"]) >= 50'
Can be used after the metadatafile (65 GB) was downloaded.
{"photoid": 8140811444, "uid": "63066451@N04", "unickname": "Nikon+D600", "datetaken": "2011-05-22 08:08:01.0", "dateuploaded": "1351663754", "capturedevice": "NIKON+CORPORATION+NIKON+D3000", "title": "%E8%A5%BF%E6%B9%96%E6%B6%8C%E9%87%91%E5%B9%BF%E5%9C%BA%E4%B8%80%E8%A7%92", "description": "", "usertags": "", "machinetags": "", "longitude": null, "latitude": null, "accuracy": null, "pageurl": "http://www.flickr.com/photos/63066451@N04/8140811444/", "downloadurl": "http://farm9.staticflickr.com/8055/8140811444_a0d79bfa72.jpg", "licensename": "Attribution-NonCommercial-NoDerivs License", "licenseurl": "http://creativecommons.org/licenses/by-nc-nd/2.0/", "serverid": 8055, "farmid": 9, "secret": "a0d79bfa72", "secretoriginal": "3586eb413d", "ext": "jpg", "marker": 0, "key": "0afa8cdfaae4062b525eecfcb6fd"}
Are the tags, I'dont know what is the best filter, maybe x["description"] != "" would be enough. The filter used for the photoid is just to make sure that the download can be split in 10 chunks (photoid starting at 0 - 9) and does not have to be downloaded all at once.
Also, sometimes the usertags and the title are quite helpful, where you could generate captions like "This picture shows
Yashbonde suggested such a caption generator:
templates_labels = [
"a picture of {}",
"a photo that has {}",
"photo consisting of {}",
"a low resolution photo of {}",
"small photo of {}",
"high resolution picture of {}",
"low resolution picture of {}",
"high res photo that has {}",
"low res photo of {}",
"{} in a photo",
"{} in a picture",
"rendered picture of {}",
"jpeg photo of {}",
"a cool photo of {}",
"{} rendered in a picture",
]
templates_maybe = [
*[x + " and maybe containing {}" for x in templates_labels],
*[x + " and possibly containing {}" for x in templates_labels],
*[x + " and {} but not sure" for x in templates_labels],
*[x + " also roughly {}" for x in templates_labels],
]
templates_indoor = [
"indoor picture of {}",
"picture inside of {}",
"picture of {} from inside",
]
templates_food = [
"picture of {}, a food item",
"photo of food {}",
"nice photo of food {}",
"picture of food item {}",
"picture of dish {}",
"picture of {}, a food dish",
"gourmet food {}",
]
templates_svhn = [
"a picture of house number '{}'",
"number '{}' written in front of a house",
"street house number '{}' written on a door",
"a photo with number '{}' written in it",
"number '{}' written on a door",
"photograph of number '{}'"
]
https://github.com/yashbonde/dall-e-baby/blob/master/generate_captions.py
Thanks @robvanvolt, adding it.
Yashbonde suggested such a caption generator:
templates_labels = [ "a picture of {}", "a photo that has {}", "photo consisting of {}", "a low resolution photo of {}", "small photo of {}", "high resolution picture of {}", "low resolution picture of {}", "high res photo that has {}", "low res photo of {}", "{} in a photo", "{} in a picture", "rendered picture of {}", "jpeg photo of {}", "a cool photo of {}", "{} rendered in a picture", ] templates_maybe = [ *[x + " and maybe containing {}" for x in templates_labels], *[x + " and possibly containing {}" for x in templates_labels], *[x + " and {} but not sure" for x in templates_labels], *[x + " also roughly {}" for x in templates_labels], ] templates_indoor = [ "indoor picture of {}", "picture inside of {}", "picture of {} from inside", ] templates_food = [ "picture of {}, a food item", "photo of food {}", "nice photo of food {}", "picture of food item {}", "picture of dish {}", "picture of {}, a food dish", "gourmet food {}", ] templates_svhn = [ "a picture of house number '{}'", "number '{}' written in front of a house", "street house number '{}' written on a door", "a photo with number '{}' written in it", "number '{}' written on a door", "photograph of number '{}'" ]
https://github.com/yashbonde/dall-e-baby/blob/master/generate_captions.py
@robvanvolt @lucidrains This in particular is very useful. Is Yashbonde interested in helping with the project? I reached out to him on GitHub issues but haven't gotten a response. If anyone knows him let him know that we could use his help with the datasets.
@afiaka87 I had a quick request a few weeks ago and got an answer within hours - I can try again and ask if he would like to participate!
The fashion dataset might be nice to perfect our mannequins: http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
I'm also trying to curate a building layouts dataset but it's taking a while. If I manage smth I'll share it in the list.
Hello,
@afiaka87 saw your comment on my repo, glad you reached out. Following are some relevant points with the dataset I had prepared:
I'll help out with some parts just let me know. @robvanvolt, I agree let's build something kick-ass! ☄️🪐
Awesome @yashbonde!
If we work together to create a MEGA public dataset (similar to the one used by Open-AI, or even better) consisting of many curated big datasets (yfcc100m, open images) and including a few specialized ones (house numbers, fashion, scenery, botanics, space, animals, ...), we might at some point have a really good basis for training, and can figure out how to increase batch size / effectiveness (128x128px maybe enough? higher compression rate? increase dataset even more by translating captions? Remove images apart from 1x1 aspect ratio ones for better training?).
Maybe even google might borrow us the currently not publicly available dataset jft-300m ( https://paperswithcode.com/dataset/jft-300m ) in the near future... x)
Hello,
@afiaka87 saw your comment on my repo, glad you reached out. Following are some relevant points with the dataset I had prepared:
- Number of unique sentences I generated were only a few and we would ideally need a wide variety
- When trained only on the COCO + Flickr8k + OpenGenome dataset the images generated from scratch only had a few textures and not any complete pattern. (I'll pull code and files from server and push them on my repo)
- You can get YFCC images from this GDrive, it's 66GB.
- When trained with the caption generator, it would generate images that had only one colour (only one token is predicted)
- I did not explore using paraphrasing engine for getting out more captions
- Wikipedia Dataset is huge and can be a great starting place.
I'll help out with some parts just let me know. @robvanvolt, I agree let's build something kick-ass! ☄️🪐
Fantastic thanks! We'd definitely love to use your code for generating captions as a start!
What's in the yfcc_images.tar.gz
file? Can it be the entire YFCC100M?
Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run?
Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?
So @afiaka87 @robvanvolt what's the next steps?
@yashbonde
edit: made a few comments on useful code here: https://github.com/yashbonde/dall-e-baby/issues/5
well, since my discovery that anything less than a large dataset is largely a waste of (my own personal) vast.ai compute, I'm still working on downscaling and re-uploading a lot of this.
I think a good next step would be a DataLoaders folder in the root of our directory here containing the correct DataLoader for each dataset.
@lucidrains Would that be a good idea in your opinion? or should we make a separate repository for training efforts?
Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run?
Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?
I think what you're getting at is essentially a distributed mesh training. As far as I know this is an unsolved problem ha.
More reallistically, I think there might be a way to let multiple people train this? It couldn't really be just random folks on the internet. Would have to be a select few people downloading chunks of the dataset and training on them separately. I imagine however, that an effort like dalle-mesh is needed to really train this thing the way that OpenAI actually did.
The GPTNeo folks are switching from TPU to GPUs. They could probably give us some insight/help in the coming months once those efforts are fleshed out.
So @afiaka87 @robvanvolt what's the next steps?
@yashbonde
I've made a few comments on your dall-e-baby repository. https://github.com/yashbonde/dall-e-baby/issues/5
Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run? Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?
I think what you're getting at is essentially a distributed mesh training. As far as I know this is an unsolved problem ha.
More reallistically, I think there might be a way to let multiple people train this? It couldn't really be just random folks on the internet. Would have to be a select few people downloading chunks of the dataset and training on them separately. I imagine however, that an effort like dalle-mesh is needed to really train this thing the way that OpenAI actually did.
The GPTNeo folks are switching from TPU to GPUs. They could probably give us some insight/help in the coming months once those efforts are fleshed out.
I definitely agree with the part that random people can't just join in at their own leisure, there will obviously be a team / group that accepts honest people.
Speaking of which, have you thought of making a dedicated team / group for your project?
Would it be possible to modify the already existing Dalle PyTorch to add support for multiple people running the same run? Like, allowing multiple people to contribute to the speed to the run of the training by opening a server we can communicate with?
I think what you're getting at is essentially a distributed mesh training. As far as I know this is an unsolved problem ha. More reallistically, I think there might be a way to let multiple people train this? It couldn't really be just random folks on the internet. Would have to be a select few people downloading chunks of the dataset and training on them separately. I imagine however, that an effort like dalle-mesh is needed to really train this thing the way that OpenAI actually did. The GPTNeo folks are switching from TPU to GPUs. They could probably give us some insight/help in the coming months once those efforts are fleshed out.
I definitely agree with the part that random people can't just join in at their own leisure, there will obviously be a team / group that accepts honest people.
Speaking of which, have you thought of making a dedicated team / group for your project?
We try, as soon as the dataset is ready, form a group of 'trusted' people who collaborate / train together - of course many issues remain and still have to be resolved before. :)
While much of this conversation may be more appropriate on the discord, I'll just say that it's starting to look like my meager attempts at compiling a dataset were somewhat pale in comparison to the efforts of others who have reached out on there.
It looks like we'll be able to obtain a dataset of the same scale as the OpenAI team without issue.
Hey all,
I'm compiling a list of the various datasets we'll need and how to download them:
Keep in mind, not all of these datasets ship with captions. However many of them do ship with a class descriptor of some type. I've only done mild testing with this, but usually you can just generate labels by doing something like "an image of {class_name}". Not sure what the best way to go about that would be though.
https://github.com/lucidrains/DALLE-pytorch/discussions/109
As it stands, this is turning out to be humongous. I just added the new Wikipedia dataset (11 million images).
Does anyone know of other captioned datasets we could use?