add bigger data sizes - Githubissues

iterative / dvc-bench

Benchmarks for DVC

http://bench.dvc.org/

Apache License 2.0

20 stars 10 forks source link

add bigger data sizes #306

Open efiop opened 2 years ago

efiop commented 2 years ago

E.g. 1M files dataset, and 10M (maybe more as well?) dataset would be great to have.

pared commented 2 years ago

It seems to me that we could try obtaining ImageNet for this use case. Its de-facto a standard dataset and can actually be used to fulfill both needs. The whole dataset contains around 14M images, and the most used subset is around 1.3 M samples. The license can be found here: https://image-net.org/download.php Seems to me that benchmarking would fall into the research category. I haven't yet requested access due to the 6th point of the license.

efiop commented 2 years ago

Yeah, a bit hesitant to use a third party dataset like that. We could generate it ourselves, I suppose. Ideally with something that would make verifying integrity easy (this is not necessarily useful for benchmarks, but in other tests).

efiop commented 2 years ago

It is probably better to just use https://pypi.org/project/Faker/ to generate the biggest dataset and then have small/tiny/etc options based on it, as we do now.

EDIT: on closer inspection, it requires us to set certain parameters, which need us to know what we are doing 😄 So maybe real one is more reasonable, if we can settle the license stuff. At least using MNIST actually tells something to our users, as they've probably used it at some point so they have a pretty good understanding of how long it usually takes to do stuff with it.

daavoo commented 2 years ago

https://storage.googleapis.com/openimages/web/download.html ?

I'm missing some context but, why not just generate X images with random pixels. Like:

import numpy
from PIL import Image

NUM_IMAGES = 1e6
for i in range(NUM_IMAGES):
    array = numpy.random.rand(100, 100, 3) * 255
    img = Image.fromarray(array.astype(np.uint8))
    img.save(f"dataset/{i}.jpg"))

efiop commented 2 years ago

For the record: added mnist (70K dataset), but some other bigger buzzwordy dataset would be nice in the future.

daavoo commented 2 years ago

It would be also nice to have a dataset with big individual fiels