Closed iesahin closed 3 years ago
I created the image directory and 70000 individual PNG images take up 273MB. A .tgz file that contains the images is 35MB.
Are we OK with the increase in size? @shcheklein If so, I'll go ahead and submit PRs for this and #17.
I'll check the other formats but the increase is mostly related with the format overhead.
Sounds a bit too much for the get started repo. Is there a way to use a subset of it? And may be at the end show the performance on the large dataset?
Maybe I can include the "zipped directory" version, and create a stage to unzip this to data/
. Using a subset seems to defeat the overall purpose of replacing the dataset.
Good point. So, both solution looks suboptimal - gz not very usual, hides details, complicates code ... images dataset is too large
Have you tried to minify pngs with tinypng or something by chance?
It's not that the individual files are large. They are around 350-400 bytes. Even a BMP would take less than 2KB for each file. (Each image is contains 784 bytes of data + format overhead.)
But each file takes at least 4KB in ext4, even if the file is 1 byte.
From tar.gz
, I mean zipped version of these PNG files, not the .gz
file of original .IDX3
file. fashion-mnist.tar.gz
file can be expanded to an image directory in a single stage and the user can see the images.
Also, download overhead will probably take up 5-10x more time for 70000 individual files. I can test a dvc pull
, but I doubt it will finish in <5 minutes for 70000 files.
For example, instead of PNG, I converted to JPEG and although the individual files are a bit larger, the resulting directory size with du -hs
is 276 MB again.
And these are the results for BMP. The individual files are 1862 bytes each but the resulting directory size is the same.
And these are sizes for tar
(without zip) that uses 512 bytes as block size
179,220,480 fashion-mnist-bmp.tar
106,158,080 fashion-mnist-jpg.tar
90,040,320 fashion-mnist-png.tar
I would still start even with 200Mb+ rather than making it an archive artificially. My though process is - how often would DS/ML teams tar their datasets? It's not convenient. I know there are some specific formats for TF, etc. It would be even better to start with them? But then we would still have to complicate the example.
OK, I'll test with individual images and see how it fares.
I'm dvc push
ing the dataset to s3://dvc-public/remote/dataset-registry-test
and it seems it will take 20-25 minutes. My connection speed is around 38Mbps upload to the next-hop (while uploading the set), so I assume that the download would take 15-20 minutes as well.
I'll update this with the download speeds. 😄
Update:
It seems it's much worse than I expected. The download takes more than the upload, around ~30 minutes.
My local speeds (to the closest server) is something like:
I can do this test in AWS, Google Cloud or Katacoda, but I doubt it will matter much. I'll update the exact time after the download.
WDYT @shcheklein
Update:
dvc pull
takes around 42 minutes. Even the checkout process took around 40 seconds.
I've pushed it to my clone.
You can test the download with
git clone git@github.com:iesahin/dataset-registry
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry-test
dvc pull fashion-mnist/images
@shcheklein
I also tested this in my VPS and getting the dataset with HTTPS seems to take about 1 hour.
Could you run it with -j 100
or -j 10
?
It feels it related to some known performance issues, need to confirm and address that. It should be around minutes as far as I understand to download it.
With -j 100
it seems to be about 20 minutes. @shcheklein
I did not want to pollute the dataset-registry
remote, so I'm using another for testing. I can push the files to the remote and merge these changes for easier testing. I can also put a zipped version and we can use any of these. Core team can use the registry for the test.
It's fine to "pollute" the data registry, there should be not harm in that. And it's good even to have it, we can use it as a test and optimize it to work for our needs.
It's annoying that we need to use -j
to speedup it though and 20 minutes for 200 MB is quite suboptimal also. Let's do this and create a ticket on DVC repo to look into this:
-j
defaults (detect if we are breaking something and decrease automatically?)For now, are there any other/simpler datasets that we could use for the get started purposes?
MNIST and Fashion-MNIST are very small datasets, to the level of being toy datasets. As a comparison, VGG Face dataset (2015) contained around 2,600,000 images and even that is small by today's standards.
DVC may be improved but there is an inherent latency when you do multiple requests. Each download is about 500 bytes, but the required TCP handshake, key exchange etc. takes time, and we do that connection 70,000 times.
I think I first need to browse the code a bit before asking for improvement. There may be easier ways to improve the networking performance by using the same Session
in requests
, this may be orthogonal to multiple jobs. (Currently, I don't even know if DVC uses requests
for HTTPS 😃 )
Yes, it looks DVC creates a new Session
object for each download. But also it looks requests
has no way to use HTTP pipelining.. So let's keep this brief and continue in a core ticket.
The structure of the Fashion-MNIST dataset is identical to MNIST.
We can use the same structure in #17.