Open iesahin opened 3 years ago
DVC push/pull performance for many small files is a known limitation. But a part of the issue is probably also specific to the HTTP remote type, it's possible you would get better performance pulling from the registry using S3 rather than HTTP (due to remote fs implementation differences)
For the purpose of the example projects, it may be better to just handle compressing/extracting images as an archive within the example pipelines.
- DVC can consider directories as implicit
tar
archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them incheckout
.
This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions.
This is also an issue that can be addressed by future cache structure changes (https://github.com/iterative/dvc/issues/829). Ideally with those kinds of cache changes we will be pushing and pulling a small # of large packed objects rather than a large # of small loose objects (so the end result is similar to the difference you see now between downloading a single tar vs 70k individual images)
Thank you @pmrowla
But a part of the issue is probably also specific to the HTTP remote type, it's possible you would get better performance pulling from the registry using S3 rather than HTTP (due to remote fs implementation differences)
Sure. I researched a bit but couldn't find a definite answer: Does boto3
reuse HTTP connections? In my tests, dvc pull
with S3 URL is faster but not that faster.
This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions.
Having an option to track immutable data directories as single tar files might have some benefit. I agree though this benefit may not cover the required architecture changes.
This multiple small objects in a directory issue needs to be resolved for #829 as well. Otherwise the cache itself will be blobs of large objects and that will probably bring some other problems.
@pmrowla what about @iesahin 's suggestions :
WDYT?
My experience with S3 and small objects and DVC. It seems it can be fast (s5cmd) even on small objects, boto3
seems to be quite suboptimal.
Not saying that we should not be migrating to the new cache structure, but it feels that we could significantly improve it even before that. And that migration might take a long time.
For the purpose of the example projects, it may be better to just handle compressing/extracting images as an archive within the example pipelines.
My take on this - by doing this, we pretty much in the get started saying that we don't handle even small datasets. Compressing all images looks quite artificial and people will be asking why are we doing this. Also it complicates the document.
use more thread by default - why don't we do this?
When we used the old "# of cores x2" method (without a hard cap on the default) for setting the number of threads, it caused problems for users running on clusters that have very high #'s of cores.
check if we indeed recreate connections (we don't use connections pool?)
We only use our own connection pooling for specific remote types (HDFS and SSH), since pyarrow and paramiko don't provide their own.
DVC uses new requests.Session objects in connection and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time.
DVC does not create new session objects for each file being downloaded.
https://github.com/iterative/dvc/blob/4e792ae61c5927ab2e5f6a6914d985d43aa705b4/dvc/fs/http.py#L72
_session
is a cached property, and is only created once per filesystem instance
HTTP pipelining - not that familiar with this one
Regarding HTTP pipelining, do the S3 webservers even support it properly? HTTP/1.1 pipelining isn't supported in most modern browsers due to the general lack of adoption and/or incorrect implementations in servers.
This is very old: https://stackoverflow.com/questions/7752802/does-s3-support-http-pipelining but I can't find anything newer, which seems to indicate that the answer is probably still "no"
In general, my understanding is that w/the move to using async instead of python thread pools (and aiobotocore) we should be able to start getting throughputs that are much faster than native aws-cli/boto, @isidentical should have some more insight on that
In general, my understanding is that w/the move to using async instead of python thread pools (and aiobotocore) we should be able to start getting throughputs that are much faster than native aws-cli/boto, @isidentical should have some more insight on that
For the other async operations, yes that is true. But the problem with a lot of small files vs a single big file is that, the relative significance of the processing stage (e.g parsing the API response for each object) to the actual data transport is much higher on small files compared to a single big file, and those steps are blocking for async. One possible (and kind of tricky) route would be (which I believe there are still some places in DVC and fsspec itself that are not compatible) creating a process-pool and attaching each process it's own event loop, which would scale these blocking operations to the number of cores and when used with async should maximize the throughput. (btw we still use regular boto for the push/pull).
Regarding pipelining, as @pmrowla said, it's not very well supported, and users will run into issues with proxy as well.
buggy proxies, marginal improvements, HOL blocking ...
It's considered a design mistake and has been superseded by multiplexing in http/2. Though we won't find much benefits with this unless we go async + http/2, which we should (HTTPFileSystem
uses aiohttp
which does not support http/2).
Is there any data regarding the adoption rate of HTTP/2 vs HTTP/1.1 pipelining?
S3 does not directly support HTTP/2 unless you are serving it behind CloudFront (or your own proxy server).
Also, just to be clear, HTTP/1.1 pipelining is not the same thing as re-using HTTP connections via the Keep-alive Header. requests.Session
instances (and DVC) already use connection pools and Keep-Alive for connections in those pools by default.
Are we planning to migrate http to fsspec and/or optimize it?
It's considered a design mistake and has been superseded by multiplexing in http/2. Though we won't find much benefits with this unless we go async + http/2, which we should (
HTTPFileSystem
usesaiohttp
which does not support http/2).
Do you mean fsspec HTTPFileSystem
? I don't see it in the current dvc implementation unless I'm missing something.
It seems like we should be able to fairly easily migrate to fsspec/aiohttp
at this point? Can we see if it's an improvement before worrying about pipelines or multiplexing yet?
I can test HTTP pipelining with S3 with a script, if you'd like. If I can come up with faster download with a single thread, we can talk about implementing it in the core.
Are we planning to migrate http to fsspec and/or optimize it?
I think moving to fsspec
's HTTPFileSystem is the plan. Right now, I think it's not possible to upload and there are some performance concerns which we could contribute to.
It seems like we should be able to fairly easily migrate to fsspec/
aiohttp
at this point? Can we see if it's an improvement before worrying about pipelines or multiplexing yet?
I don't think it'll make much difference unless we go async.
We discussed this in planning, and there are a few different items to address in this issue:
Difference between get
and pull
- this will hopefully be addressed soon.
Optimization of http
- for now, it seems like performance of http
is not too different from s3
, so probably not worth pursuing this now.
Optimization of download operations for directories -
I did a dirty performance comparison:
dvc pull -j 10
took 22 minutes for ~70k files.
s5cmd --numworkers=10 cp "s3://dvc-public/remote/dataset-registry/*" download
took 21 minutes for ~140k files.
So my rough estimate is that dvc pull
is about 2x slower than s5cmd
here. It's worth looking into why and hopefully optimizing further, but we shouldn't expect to get this under ~10 minutes for now at best.
Excellent summary, thanks guys! thanks Dave!
One question though - why don't we increase number of workers by default? I think we should be a bit more aggressive.
Also, one more experiment - what happens with -j 100
and --numworkers=100
? Just to see how does it scale. Also, how much of your network bandwidth do they take?
One question though - why don't we increase number of workers by default? I think we should be a bit more aggressive.
It is too dangerous to just set it to something bigger without having a proper throttling mechanism, or we will start getting timeout errors or "too many files" errors. Throttling should be handled by each filesystem's put/get
methods and we are currently working on migrating towards those with amazing changes from @isidentical in fsspec and amazing changes from @pmrowla in odb transfer.
I don't know the case for S3, probably the limit is much higher or it's only limited by the bandwidth, but HTTP/1.1 protocol doesn't allow more than 2 connections per client to a server.
Most servers set this to 4 or a bit higher, but I think the limitation is not the number of jobs or workers, but the server limits. IMO DVC should be conservative by default, a first experience with dvc pull
in a modest environment shouldn't choke the machine.
I don't know the case for S3, probably the limit is much higher or it's only limited by the bandwidth, but HTTP/1.1 protocol doesn't allow more than 2 connections per client to a server.
I think that's obsoleted by rfc7230.
Previous revisions of HTTP gave a specific number of connections as a ceiling, but this was found to be impractical for many applications. As a result, this specification does not mandate a particular maximum number of connections but, instead, encourages clients to be conservative when opening multiple connections.
Yep, it would be really great to have some dynamic workload balancing/throttle/etc. But even w/o it we could use max(15, current_value)
, wdyt?
Also, one more experiment - what happens with
-j 100
and--numworkers=100
? Just to see how does it scale. Also, how much of your network bandwidth do they take?
TLDR: It looks like the multiple processes launched by s5cmd do improve performance, whereas the additional threads in dvc don't seem to be helping much.
s5cmd --numworkers 100
finishes in about 5 minutes for me on multiple tries. Here's a random snapshot of cpu and network activity while files were downloading:
dvc pull -j 100
varied more with multiple attempts but took up to 20 minutes. Here are cpu and network activity snapshots:
If there is an easy way to measure the total throughput in dvc fetch
, an adaptive algorithm can be used to set the number of jobs.
jobs=4
It is environment dependent but I doubt the number of jobs increases above 10 in most cases.
I think that's obsoleted by rfc7230.
Ah, thank you @skshetry It's again a low number though.
My comparison between dvc pull
and s5cmd
is similar to Dave's, but more of like 3x difference:
s5cmd cp 's3://dvc-public/remote/get-started-experiments/*'
takes 11:35
dvc pull -j 100
takes 33:19
These are with 70.000 identical files.
The default --numworkers
for s5cmd
is 256. In the above experiment it's set to 10.
Removing the p1 label here because this is clearly not going to be closed in the next sprint or so, which should be the goal for a p1. However, it's still a high priority that we will continue to be working to improve.
Are there plans to address this soon? We have run into scalability issues with DVC due to the poor performance on data sets with a large number of small files.
@series8217, what remote do you use? What is the size of datasets (no. of files, total size, etc)? Thanks.
@series8217, what remote do you use? What is the size of datasets (no. of files, total size, etc)? Thanks.
s3 remote with a local cache on a reflink-supported filesystem. ~3 million tracked files 3TB total. When tracked across 1000 dvc files, dvc pull
with the working copy, local cache, and remote already consistent (i.e. no files to pull other than comparing checksums) takes 3 hours. If I use fewer .dvc files to represent the same 3 million tracked files, it gets faster... roughly half the time with 20 .dvc files instead even though they're tracking the same total number of files.
I suspect the checksumming step always takes the same amount of time, but the s3 download of the .dvc files is where the extra hour and a half comes from. Given that the .dvc files are only a few hundred bytes each, this should take only a few minutes.
- DVC can consider directories as implicit
tar
archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them incheckout
.This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions.
Could it be given as an option? In the same way that you can specify "cache"/"push"/... on "output entries", could there be an option "granular: false" (default would be "granular: true" which corresponds to the current behavior)? With the non-granular option, the directory would be considered as an implicit-tar, you invalidate/download/upload all of it at once.
Using DVC, there are some cases where we do like the granularity of DVC for directories. But other cases where, normally, either all files have changed or none have changed (e.g. the output of our segmentation networks on our benchmark dataset). We handle this by asking DVC to track archives of our non-granular directories, and directly the directory for our granular directories. We did get a big performance boost, but it adds a small friction to navigate in the directory. Such an option would be great in our use case.
Bug Report
get
/fetch
/pull
/push
... so I didn't put a tag.Description
We have added a directory containing 70,000 small images to the Dataset Registry. There is also a
tar.gz
version of the dataset which is downloaded quickly:When I issue:
I get ~16 hours ETA for 70.000 downloads in my VPS.
This is reduced to ~3 hours on my faster local machine.
I didn't wait to finish these, so the real times may be different but you get the idea.
For
-j 10
it doesn't differ much:dvc pull
is better, it's takes about 20-25 minutes.(At this point, while writing a new version released and the rest of the report is in
2.4.1
š )dvc pull -j 100
seems to reduce the ETA to 10 minutes.(I waited for
dvc pull -j 100
to finish and it took ~15 minutes.)I also had this issue while uploading the data in iterative/dataset-registry#18 and we have a discussion there.
Reproduce
or
Expected
We will use this dataset (and
fashion-mnist
similar to this) in example repositories, we would like to have some acceptable time (<2 minutes) for the whole directory to download.Environment information
Output of
dvc doctor
:Some of this report is with
2.3.0
but currently:Discussion
DVC uses new
requests.Session
objects in connection and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time.There is a mechanism in HTTP/1.1 to use the same connection. but
requests
doesn't support it..Note that increasing the number of jobs doesn't make much difference, because servers usually limit the number of connections per IP. Even if you have 100 threads/processes to download, it's probably a small number (~4-8) of these can be connected at a time. (I'm banned from AWS once while testing the commands with large
-j
.)There may be 2 solutions for this:
DVC can consider directories as implicit
tar
archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them incheckout
.tar
andgzip
are supported in Python standard library. This probably requires allRepo
class to be updated though.Instead of
requests
, DVC can use a custom solution or another library likedugong
that supports HTTP pipelining. I didn't test any HTTP pipelining solution in Python, so I can't vouch for any of them but this may be better for all asynchronous operations using HTTP(S).