Open bjornamr opened 8 years ago
Hi @bjornamr!
I can only comment on the technical side of the question (distributing the dataset via Torrent). 20TB (the total size of all images in the original resolution) is much larger than what is considered practical for this technology, as it would require a special hardware setup. I also have a feeling that checksumming all the data will take at least hours.
It seems that only a cloud storage of a kind (S3, GCS, whatever) could be a basis for that, but I might be missing some obvious idea.
@gkrasin Hi, I would suggest down scaling the pictures to for example 640x480, or even smaller 256x256. This would reduce this to about 1TB or around 500gb, which would be possible to share via Torrent. This is ofc, only the case if you have a pretty decent upload speed.
Internally, we use the thumbnails of two sizes:
It's our belief that having thumbnails smaller than 640x480 hursts the ability to train an image-level classifier. It hurts even more for other tasks (object detection / localization / segmentation / colorization / etc).
So, yeah, scaling the thumbnails down to 640x480 will get you into 1 TB territory. That fits a single machine, but still on the edge of being practical for distributing via Torrent. Anyway, let me know if you ever put this together.
@bjornamr I created torrent file. All images resized to 420 on small side.
@N01Z3 nice!
By the way, what do these numbers mean?
Train
all: 9011219
downloaded: 8798643
labeled: 8646180
post-download clean: 8591564
Validation
all: 167056
downloaded: 160957
post-download clean: 159847
Does it mean that you have deleted all images without labels? (if that's the case, I would highly recommend to put them back, as the current labels are by no means final, and the images with missing labels are obvious targets for improvements).
Also, what additional cleaning did you do?
@gkrasin All - amount of urls Downloaded - images that I was able to download Labeled - all images with labels Post-Download Clean - amount of images after removing all images with 0 Kb size or white images with error of access.
I didn't delete unlabeled images. But I didn't include unlabeled images to torrent. I going to post url to archives, it's about 8 Gb.
Thank you for clarification. The "post-download clean" seems reasonable.
I didn't delete unlabeled images. But I didn't include unlabeled images to torrent.
Sorry for a confusion, by "deleting" I meant "not including".
@gkrasin Why not split up the dataset into smaller parts, say 200 parts of 100 GB each and create torrents of each? That way, its easier to checksum and torrents do support files of 100 GB+.
Also, users can add all the torrents in their client and download one by one.
@gokul-uf I guess it's the question to the torrents creators, not to me. I have no opinion on that.
@gokul-uf I'm not associated with people who made this dataset. But I downloaded most of all images (for my purpose 420 on small size is enough). And I shared this dataset through torrent just because I can. IMO in case torrent one big archive is acceptable solution.
@N01Z3 Arthur, thank you for your effort and for sharing it with everyone.
@N01Z3 Going to try to download it now. Ty :)
I had an idea related to this- after locally creating 65k subdirs for the dataset (0000 - ffff) I noticed that the ImageID’s are pretty fairly distributed (about 150 images under each). It might be reasonable to distribute as 65,536 .tar files (uncompressed since JPG compression is already better than gzip), and then bin the images by first 4 chars in the ImageID.
Then there could be a downloader script (similar to youtube8m download.py) that gets each of these tar files (~300MB each) and checks their checksum. Also rsync isn’t bad at handling deltas in 300MB files if the set of actual images changes in the future, so the update process could be automated as well.
It is quite hard to download the dataset at the moment. It would be nice if we could get a torrent up and running. I would be happy to seed after I have downloaded.
I have a setup of i5 quad core. This makes it hard to thread it to download it fast. I have tried, and I hit the limit of the machines CPU and not the bandwidth. Would be happy to get this dataset down to use it for my master thesis.