Open efiop opened 2 years ago
It seems to me that we could try obtaining ImageNet for this use case. Its de-facto a standard dataset and can actually be used to fulfill both needs. The whole dataset contains around 14M images, and the most used subset is around 1.3 M samples. The license can be found here: https://image-net.org/download.php Seems to me that benchmarking would fall into the research category. I haven't yet requested access due to the 6th point of the license.
Yeah, a bit hesitant to use a third party dataset like that. We could generate it ourselves, I suppose. Ideally with something that would make verifying integrity easy (this is not necessarily useful for benchmarks, but in other tests).
It is probably better to just use https://pypi.org/project/Faker/ to generate the biggest dataset and then have small/tiny/etc options based on it, as we do now.
EDIT: on closer inspection, it requires us to set certain parameters, which need us to know what we are doing 😄 So maybe real one is more reasonable, if we can settle the license stuff. At least using MNIST actually tells something to our users, as they've probably used it at some point so they have a pretty good understanding of how long it usually takes to do stuff with it.
https://storage.googleapis.com/openimages/web/download.html ?
I'm missing some context but, why not just generate X images with random pixels. Like:
import numpy
from PIL import Image
NUM_IMAGES = 1e6
for i in range(NUM_IMAGES):
array = numpy.random.rand(100, 100, 3) * 255
img = Image.fromarray(array.astype(np.uint8))
img.save(f"dataset/{i}.jpg"))
For the record: added mnist (70K dataset), but some other bigger buzzwordy dataset would be nice in the future.
It would be also nice to have a dataset with big individual fiels
E.g. 1M files dataset, and 10M (maybe more as well?) dataset would be great to have.