iterative / dvc

šŸ¦‰ Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.86k stars 1.19k forks source link

directory download and uploads are slow #6222

Open iesahin opened 3 years ago

iesahin commented 3 years ago

Bug Report

Description

We have added a directory containing 70,000 small images to the Dataset Registry. There is also a tar.gz version of the dataset which is downloaded quickly:

time dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz
dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz  3.41s user 1.36s system 45% cpu 10.411 total

When I issue:

dvc get https://github.com/iterative/dataset-registry mnist/images

Screen Shot 2021-06-24 at 14 04 50

I get ~16 hours ETA for 70.000 downloads in my VPS.

This is reduced to ~3 hours on my faster local machine.

Screen Shot 2021-06-24 at 14 27 34

I didn't wait to finish these, so the real times may be different but you get the idea.

For -j 10 it doesn't differ much:

Screen Shot 2021-06-24 at 14 31 27

dvc pull is better, it's takes about 20-25 minutes.

Screen Shot 2021-06-24 at 14 34 54

(At this point, while writing a new version released and the rest of the report is in 2.4.1 šŸ˜„ )

dvc pull -j 100 seems to reduce the ETA to 10 minutes.

Screen Shot 2021-06-24 at 14 41 47

(I waited for dvc pull -j 100 to finish and it took ~15 minutes.)

I also had this issue while uploading the data in iterative/dataset-registry#18 and we have a discussion there.

Reproduce

git clone https://github.com/iterative/dataset-registry
cd dataset-registry
dvc pull mnist/images.dvc

or

dvc get https://github.com/iterative/dataset-registry mnist/images

Expected

We will use this dataset (and fashion-mnist similar to this) in example repositories, we would like to have some acceptable time (<2 minutes) for the whole directory to download.

Environment information

Output of dvc doctor:

Some of this report is with 2.3.0 but currently:

$ dvc doctor
DVC version: 2.4.1 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-74-generic-x86_64-with-glibc2.29
Supports: azure, gdrive, gs, hdfs, webhdfs, http, https, s3, ssh, oss

Discussion

DVC uses new requests.Session objects in connection and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time.

There is a mechanism in HTTP/1.1 to use the same connection. but requests doesn't support it..

Note that increasing the number of jobs doesn't make much difference, because servers usually limit the number of connections per IP. Even if you have 100 threads/processes to download, it's probably a small number (~4-8) of these can be connected at a time. (I'm banned from AWS once while testing the commands with large -j.)

There may be 2 solutions for this:

pmrowla commented 3 years ago

DVC push/pull performance for many small files is a known limitation. But a part of the issue is probably also specific to the HTTP remote type, it's possible you would get better performance pulling from the registry using S3 rather than HTTP (due to remote fs implementation differences)

For the purpose of the example projects, it may be better to just handle compressing/extracting images as an archive within the example pipelines.


  • DVC can consider directories as implicit tar archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them in checkout.

This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions.

pmrowla commented 3 years ago

This is also an issue that can be addressed by future cache structure changes (https://github.com/iterative/dvc/issues/829). Ideally with those kinds of cache changes we will be pushing and pulling a small # of large packed objects rather than a large # of small loose objects (so the end result is similar to the difference you see now between downloading a single tar vs 70k individual images)

iesahin commented 3 years ago

Thank you @pmrowla

But a part of the issue is probably also specific to the HTTP remote type, it's possible you would get better performance pulling from the registry using S3 rather than HTTP (due to remote fs implementation differences)

Sure. I researched a bit but couldn't find a definite answer: Does boto3 reuse HTTP connections? In my tests, dvc pull with S3 URL is faster but not that faster.

This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions.

Having an option to track immutable data directories as single tar files might have some benefit. I agree though this benefit may not cover the required architecture changes.

This multiple small objects in a directory issue needs to be resolved for #829 as well. Otherwise the cache itself will be blobs of large objects and that will probably bring some other problems.

shcheklein commented 3 years ago

@pmrowla what about @iesahin 's suggestions :

WDYT?


My experience with S3 and small objects and DVC. It seems it can be fast (s5cmd) even on small objects, boto3 seems to be quite suboptimal.

Not saying that we should not be migrating to the new cache structure, but it feels that we could significantly improve it even before that. And that migration might take a long time.


For the purpose of the example projects, it may be better to just handle compressing/extracting images as an archive within the example pipelines.

My take on this - by doing this, we pretty much in the get started saying that we don't handle even small datasets. Compressing all images looks quite artificial and people will be asking why are we doing this. Also it complicates the document.

pmrowla commented 3 years ago

use more thread by default - why don't we do this?

When we used the old "# of cores x2" method (without a hard cap on the default) for setting the number of threads, it caused problems for users running on clusters that have very high #'s of cores.

check if we indeed recreate connections (we don't use connections pool?)

We only use our own connection pooling for specific remote types (HDFS and SSH), since pyarrow and paramiko don't provide their own.

DVC uses new requests.Session objects in connection and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time.

DVC does not create new session objects for each file being downloaded. https://github.com/iterative/dvc/blob/4e792ae61c5927ab2e5f6a6914d985d43aa705b4/dvc/fs/http.py#L72 _session is a cached property, and is only created once per filesystem instance

HTTP pipelining - not that familiar with this one

Regarding HTTP pipelining, do the S3 webservers even support it properly? HTTP/1.1 pipelining isn't supported in most modern browsers due to the general lack of adoption and/or incorrect implementations in servers.

This is very old: https://stackoverflow.com/questions/7752802/does-s3-support-http-pipelining but I can't find anything newer, which seems to indicate that the answer is probably still "no"


In general, my understanding is that w/the move to using async instead of python thread pools (and aiobotocore) we should be able to start getting throughputs that are much faster than native aws-cli/boto, @isidentical should have some more insight on that

isidentical commented 3 years ago

In general, my understanding is that w/the move to using async instead of python thread pools (and aiobotocore) we should be able to start getting throughputs that are much faster than native aws-cli/boto, @isidentical should have some more insight on that

For the other async operations, yes that is true. But the problem with a lot of small files vs a single big file is that, the relative significance of the processing stage (e.g parsing the API response for each object) to the actual data transport is much higher on small files compared to a single big file, and those steps are blocking for async. One possible (and kind of tricky) route would be (which I believe there are still some places in DVC and fsspec itself that are not compatible) creating a process-pool and attaching each process it's own event loop, which would scale these blocking operations to the number of cores and when used with async should maximize the throughput. (btw we still use regular boto for the push/pull).

skshetry commented 3 years ago

Regarding pipelining, as @pmrowla said, it's not very well supported, and users will run into issues with proxy as well.

buggy proxies, marginal improvements, HOL blocking ...

It's considered a design mistake and has been superseded by multiplexing in http/2. Though we won't find much benefits with this unless we go async + http/2, which we should (HTTPFileSystem uses aiohttp which does not support http/2).

iesahin commented 3 years ago

Is there any data regarding the adoption rate of HTTP/2 vs HTTP/1.1 pipelining?

pmrowla commented 3 years ago

S3 does not directly support HTTP/2 unless you are serving it behind CloudFront (or your own proxy server).


Also, just to be clear, HTTP/1.1 pipelining is not the same thing as re-using HTTP connections via the Keep-alive Header. requests.Session instances (and DVC) already use connection pools and Keep-Alive for connections in those pools by default.

dberenbaum commented 3 years ago

Are we planning to migrate http to fsspec and/or optimize it?

It's considered a design mistake and has been superseded by multiplexing in http/2. Though we won't find much benefits with this unless we go async + http/2, which we should (HTTPFileSystem uses aiohttp which does not support http/2).

Do you mean fsspec HTTPFileSystem? I don't see it in the current dvc implementation unless I'm missing something.

It seems like we should be able to fairly easily migrate to fsspec/aiohttp at this point? Can we see if it's an improvement before worrying about pipelines or multiplexing yet?

iesahin commented 3 years ago

I can test HTTP pipelining with S3 with a script, if you'd like. If I can come up with faster download with a single thread, we can talk about implementing it in the core.

skshetry commented 3 years ago

Are we planning to migrate http to fsspec and/or optimize it?

I think moving to fsspec's HTTPFileSystem is the plan. Right now, I think it's not possible to upload and there are some performance concerns which we could contribute to.

It seems like we should be able to fairly easily migrate to fsspec/aiohttp at this point? Can we see if it's an improvement before worrying about pipelines or multiplexing yet?

I don't think it'll make much difference unless we go async.

dberenbaum commented 3 years ago

We discussed this in planning, and there are a few different items to address in this issue:

  1. Difference between get and pull - this will hopefully be addressed soon.

  2. Optimization of http - for now, it seems like performance of http is not too different from s3, so probably not worth pursuing this now.

  3. Optimization of download operations for directories -

I did a dirty performance comparison:

dvc pull -j 10 took 22 minutes for ~70k files. s5cmd --numworkers=10 cp "s3://dvc-public/remote/dataset-registry/*" download took 21 minutes for ~140k files.

So my rough estimate is that dvc pull is about 2x slower than s5cmd here. It's worth looking into why and hopefully optimizing further, but we shouldn't expect to get this under ~10 minutes for now at best.

  1. Ensure setup for getting started project only takes a couple minutes to set up - @iesahin we can discuss this separately in dvc.org.
shcheklein commented 3 years ago

Excellent summary, thanks guys! thanks Dave!

shcheklein commented 3 years ago

One question though - why don't we increase number of workers by default? I think we should be a bit more aggressive.

Also, one more experiment - what happens with -j 100 and --numworkers=100? Just to see how does it scale. Also, how much of your network bandwidth do they take?

efiop commented 3 years ago

One question though - why don't we increase number of workers by default? I think we should be a bit more aggressive.

It is too dangerous to just set it to something bigger without having a proper throttling mechanism, or we will start getting timeout errors or "too many files" errors. Throttling should be handled by each filesystem's put/get methods and we are currently working on migrating towards those with amazing changes from @isidentical in fsspec and amazing changes from @pmrowla in odb transfer.

iesahin commented 3 years ago

I don't know the case for S3, probably the limit is much higher or it's only limited by the bandwidth, but HTTP/1.1 protocol doesn't allow more than 2 connections per client to a server.

Most servers set this to 4 or a bit higher, but I think the limitation is not the number of jobs or workers, but the server limits. IMO DVC should be conservative by default, a first experience with dvc pull in a modest environment shouldn't choke the machine.

skshetry commented 3 years ago

I don't know the case for S3, probably the limit is much higher or it's only limited by the bandwidth, but HTTP/1.1 protocol doesn't allow more than 2 connections per client to a server.

I think that's obsoleted by rfc7230.

Previous revisions of HTTP gave a specific number of connections as a ceiling, but this was found to be impractical for many applications. As a result, this specification does not mandate a particular maximum number of connections but, instead, encourages clients to be conservative when opening multiple connections.

shcheklein commented 3 years ago

Yep, it would be really great to have some dynamic workload balancing/throttle/etc. But even w/o it we could use max(15, current_value), wdyt?

dberenbaum commented 3 years ago

Also, one more experiment - what happens with -j 100 and --numworkers=100? Just to see how does it scale. Also, how much of your network bandwidth do they take?

TLDR: It looks like the multiple processes launched by s5cmd do improve performance, whereas the additional threads in dvc don't seem to be helping much.

s5cmd --numworkers 100 finishes in about 5 minutes for me on multiple tries. Here's a random snapshot of cpu and network activity while files were downloading:

s5cmd_cpu s5cmd_network

dvc pull -j 100 varied more with multiple attempts but took up to 20 minutes. Here are cpu and network activity snapshots:

dvc_pull_cpu dvc_pull_network
iesahin commented 3 years ago

If there is an easy way to measure the total throughput in dvc fetch, an adaptive algorithm can be used to set the number of jobs.

It is environment dependent but I doubt the number of jobs increases above 10 in most cases.

iesahin commented 3 years ago

I think that's obsoleted by rfc7230.

Ah, thank you @skshetry It's again a low number though.

iesahin commented 3 years ago

My comparison between dvc pull and s5cmd is similar to Dave's, but more of like 3x difference:

These are with 70.000 identical files.

The default --numworkers for s5cmd is 256. In the above experiment it's set to 10.

dberenbaum commented 2 years ago

Removing the p1 label here because this is clearly not going to be closed in the next sprint or so, which should be the goal for a p1. However, it's still a high priority that we will continue to be working to improve.

series8217 commented 1 year ago

Are there plans to address this soon? We have run into scalability issues with DVC due to the poor performance on data sets with a large number of small files.

skshetry commented 1 year ago

@series8217, what remote do you use? What is the size of datasets (no. of files, total size, etc)? Thanks.

series8217 commented 1 year ago

@series8217, what remote do you use? What is the size of datasets (no. of files, total size, etc)? Thanks.

s3 remote with a local cache on a reflink-supported filesystem. ~3 million tracked files 3TB total. When tracked across 1000 dvc files, dvc pull with the working copy, local cache, and remote already consistent (i.e. no files to pull other than comparing checksums) takes 3 hours. If I use fewer .dvc files to represent the same 3 million tracked files, it gets faster... roughly half the time with 20 .dvc files instead even though they're tracking the same total number of files.

I suspect the checksumming step always takes the same amount of time, but the s3 download of the .dvc files is where the extra hour and a half comes from. Given that the .dvc files are only a few hundred bytes each, this should take only a few minutes.

matthieu-avivacredito commented 1 year ago
  • DVC can consider directories as implicit tar archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them in checkout.

This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions.

Could it be given as an option? In the same way that you can specify "cache"/"push"/... on "output entries", could there be an option "granular: false" (default would be "granular: true" which corresponds to the current behavior)? With the non-granular option, the directory would be considered as an implicit-tar, you invalidate/download/upload all of it at once.

Using DVC, there are some cases where we do like the granularity of DVC for directories. But other cases where, normally, either all files have changed or none have changed (e.g. the output of our segmentation networks on our benchmark dataset). We handle this by asking DVC to track archives of our non-granular directories, and directly the directory for our granular directories. We did get a big performance boost, but it adds a small friction to navigate in the directory. Such an option would be great in our use case.