iterative / dataset-registry

Dataset registry DVC project
67 stars 39 forks source link

Add image directory version of Fashion-MNIST dataset #18

Closed iesahin closed 3 years ago

iesahin commented 3 years ago

The structure of the Fashion-MNIST dataset is identical to MNIST.

We can use the same structure in #17.

iesahin commented 3 years ago

I created the image directory and 70000 individual PNG images take up 273MB. A .tgz file that contains the images is 35MB.

Are we OK with the increase in size? @shcheklein If so, I'll go ahead and submit PRs for this and #17.

I'll check the other formats but the increase is mostly related with the format overhead.

shcheklein commented 3 years ago

Sounds a bit too much for the get started repo. Is there a way to use a subset of it? And may be at the end show the performance on the large dataset?

iesahin commented 3 years ago

Maybe I can include the "zipped directory" version, and create a stage to unzip this to data/. Using a subset seems to defeat the overall purpose of replacing the dataset.

shcheklein commented 3 years ago

Good point. So, both solution looks suboptimal - gz not very usual, hides details, complicates code ... images dataset is too large

Have you tried to minify pngs with tinypng or something by chance?

iesahin commented 3 years ago

It's not that the individual files are large. They are around 350-400 bytes. Even a BMP would take less than 2KB for each file. (Each image is contains 784 bytes of data + format overhead.)

But each file takes at least 4KB in ext4, even if the file is 1 byte.

From tar.gz, I mean zipped version of these PNG files, not the .gz file of original .IDX3 file. fashion-mnist.tar.gz file can be expanded to an image directory in a single stage and the user can see the images.

Also, download overhead will probably take up 5-10x more time for 70000 individual files. I can test a dvc pull, but I doubt it will finish in <5 minutes for 70000 files.

Screenshot_20210617-203956_Termux.jpg

iesahin commented 3 years ago

For example, instead of PNG, I converted to JPEG and although the individual files are a bit larger, the resulting directory size with du -hs is 276 MB again.

Screenshot_20210617-204709_Termux.jpg

iesahin commented 3 years ago

And these are the results for BMP. The individual files are 1862 bytes each but the resulting directory size is the same.

Screenshot_20210617-205454_Termux.jpg

iesahin commented 3 years ago

And these are sizes for tar (without zip) that uses 512 bytes as block size

179,220,480 fashion-mnist-bmp.tar
106,158,080 fashion-mnist-jpg.tar
 90,040,320 fashion-mnist-png.tar
shcheklein commented 3 years ago

I would still start even with 200Mb+ rather than making it an archive artificially. My though process is - how often would DS/ML teams tar their datasets? It's not convenient. I know there are some specific formats for TF, etc. It would be even better to start with them? But then we would still have to complicate the example.

iesahin commented 3 years ago

OK, I'll test with individual images and see how it fares.

iesahin commented 3 years ago

I'm dvc pushing the dataset to s3://dvc-public/remote/dataset-registry-test and it seems it will take 20-25 minutes. My connection speed is around 38Mbps upload to the next-hop (while uploading the set), so I assume that the download would take 15-20 minutes as well.

I'll update this with the download speeds. 😄


Update:

It seems it's much worse than I expected. The download takes more than the upload, around ~30 minutes.

Screen Shot 2021-06-22 at 13 25 03

My local speeds (to the closest server) is something like:

Screen Shot 2021-06-22 at 13 22 57

I can do this test in AWS, Google Cloud or Katacoda, but I doubt it will matter much. I'll update the exact time after the download.

WDYT @shcheklein


Update:

dvc pull takes around 42 minutes. Even the checkout process took around 40 seconds.

iesahin commented 3 years ago

I've pushed it to my clone.

You can test the download with

git clone git@github.com:iesahin/dataset-registry
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry-test
dvc pull fashion-mnist/images

@shcheklein

iesahin commented 3 years ago

I also tested this in my VPS and getting the dataset with HTTPS seems to take about 1 hour.

shcheklein commented 3 years ago

Could you run it with -j 100 or -j 10?

It feels it related to some known performance issues, need to confirm and address that. It should be around minutes as far as I understand to download it.

iesahin commented 3 years ago

With -j 100 it seems to be about 20 minutes. @shcheklein

iesahin commented 3 years ago

I did not want to pollute the dataset-registry remote, so I'm using another for testing. I can push the files to the remote and merge these changes for easier testing. I can also put a zipped version and we can use any of these. Core team can use the registry for the test.

shcheklein commented 3 years ago

It's fine to "pollute" the data registry, there should be not harm in that. And it's good even to have it, we can use it as a test and optimize it to work for our needs.

It's annoying that we need to use -j to speedup it though and 20 minutes for 200 MB is quite suboptimal also. Let's do this and create a ticket on DVC repo to look into this:

For now, are there any other/simpler datasets that we could use for the get started purposes?

iesahin commented 3 years ago

MNIST and Fashion-MNIST are very small datasets, to the level of being toy datasets. As a comparison, VGG Face dataset (2015) contained around 2,600,000 images and even that is small by today's standards.

DVC may be improved but there is an inherent latency when you do multiple requests. Each download is about 500 bytes, but the required TCP handshake, key exchange etc. takes time, and we do that connection 70,000 times.

iesahin commented 3 years ago

I think I first need to browse the code a bit before asking for improvement. There may be easier ways to improve the networking performance by using the same Session in requests, this may be orthogonal to multiple jobs. (Currently, I don't even know if DVC uses requests for HTTPS 😃 )

iesahin commented 3 years ago

Yes, it looks DVC creates a new Session object for each download. But also it looks requests has no way to use HTTP pipelining.. So let's keep this brief and continue in a core ticket.