Closed dberenbaum closed 1 year ago
@dberenbaum could you maybe share the profile dvc push --viztracer --viztracer-depth 8
?
It should still generate a profile when manually interrupted
I have been testing a larger number of files and for me, it's taking quite some time in building the data index https://github.com/iterative/dvc/blob/dd2d2dce198b9ee48ad27059932c4cb630f3bd0c/dvc/repo/worktree.py#L112 , when the folder added as remote worktree has a significant number of files
It's slow because we have to walk the worktree remote so we can do the diff to see what needs to be pushed and what needs to be deleted in the remote. We probably need to display some kind of message while it is building the index, but we can't use a regular progressbar because we don't know how many files there will be in the remote.
Compare the video above to this one, which pushes a dataset ~3x bigger:
https://user-images.githubusercontent.com/2308172/213544969-d65a7701-314e-46f5-a6aa-c9c8cf989f45.mov
In the first video, I've checked that the "hanging" happens for ~5 minutes, while it lasts almost no time in the second video.
The only difference I can tell is that there are more previous versions of objects in the first video. Why should that matter here if we are only checking the current versions?
It's because in S3 you have to list all object versions if you want any versioning information at all
🤔 Why isn't it a problem for version_aware remotes? Don't we need version info there also?
And do we need the versions when pushing to a worktree remote? If we just want to check the current version, can we use etags?
Edit: There also might be workarounds like https://stackoverflow.com/a/72281706
@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:
🤔 Why isn't it a problem for version_aware remotes? Don't we need version info there also?
For version_aware
we only need to check versions that we know about (because we think we pushed them already). For worktree
we have to also check for files that exist in the remote but not in our repo (and then delete them). For version_aware
we can just ignore files we don't know about at all.
The version_aware
check will be faster in cases where using individual exists
calls completes faster than listing the entire remote's worth of existing objects. In worktree
we can't use exists
since that doesn't help us w/deleting files that are in the remote but not in the DVC workspace.
@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:
I think is because it's too big. In my computer Firefox breaks but chrome manages to render it
@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:
Wow I have never seen this 😅 The profile is completely blanket inside checkout
for 200s before reaching the sum
:
Anyhow, looks like you ran before the changes made in #8842 :
Could you rerun with the latest version?
Edit: There also might be workarounds like https://stackoverflow.com/a/72281706
head-object only works for individual files, and it's what we already do for querying an individual object
If we just want to check the current version, can we use etags?
This could work, but will probably require some adjustment in dvc-data because we will essentially need to mix non-versioned filesystem calls with versioned ones.
The issue is that when listing a prefix, (which is the only way we can find files we don't know about) you can either do list_objects_v2
which does not return any version information at all, or list_object_versions
which returns all object versions. The current fsspec s3fs implementation always uses list_object_versions
when doing any directory listing, since it is the only way for s3fs to make sure it gets version information for everything.
So to get the "etags only" listing from S3 we need a non-versioned s3fs instance (to use list_objects_v2
).
Also, this is only a problem on S3. The azure and gcs APIs are designed properly so that you can list only the latest versions of a prefix and get a response that includes version IDs for those latest versions. (Using the etag method won't improve performance vs using IDs on azure or gcs)
~> For version_aware
we only need to check versions that we know about (because we think we pushed them already). For worktree
we have to also check for files that exist in the remote but not in our repo (and then delete them). For version_aware
we can just ignore files we don't know about at all.~
~Don't we only need to get the current versions for that? Why do we need the version IDs of those files to know what to delete?~
Edit: addressed above by @pmrowla
I've narrowed down the problem a little -- it only happens on the initial push/when there is no existing cloud metadata.
Could you rerun with the latest version?
On the latest version, dvc push
seems to wrongly think everything is up to date:
$ dvc push -r worktree cats-dogs
Everything is up to date.
$ cat cats-dogs.dvc
outs:
- md5: 22e3f61e52c0ba45334d973244efc155.dir
size: 64128504
nfiles: 2800
path: cats-dogs
$ dvc config -l
core.analytics=false
remote.worktree.url=s3://dave-sandbox-versioning/test/worktree
remote.worktree.worktree=true
$ aws s3 ls --recursive s3://dave-sandbox-versioning/test/worktree/
# returns nothing
On the latest version, dvc push seems to wrongly think everything is up to date:
https://github.com/iterative/dvc/pull/8842 seems to have fixed the "hanging" issue and made worktree and version_aware remotes perform similarly.
However, it also slowed down the overall operation, spending a lot of time on "Updating meta for new files":
https://user-images.githubusercontent.com/2308172/213741985-804ee1a6-c815-4309-8144-116dd0408e70.mov
I'm assuming it's related to listing all object versions, but I'm not clear why it's so much worse in the new version.
Here are results that I think you can reproduce more or less by:
git clone https://github.com/dberenbaum/cloud-versioning-test
dvc pull
dvc push -r versioned
or dvc push -r worktree
@dberenbaum, can you try running viztracer
with --log_async
please? See https://viztracer.readthedocs.io/en/latest/concurrency.html#asyncio.
Anyway, it seems like it's calling fs.info()
after checkout calls for all the files to update metadata which is taking time.
@dberenbaum, can you try running
viztracer
with--log_async
please? See https://viztracer.readthedocs.io/en/latest/concurrency.html#asyncio.
However, it also slowed down the overall operation, spending a lot of time on "Updating meta for new files":
@dberenbaum is your actual real-world time
performance worse than it was before?
The "updating meta" behavior is the same as it was before, the only difference is that gets a separate progressbar now. Previously it was included in the overall push
and showed up as an individual upload progressbar sitting at 100% before the "total push" bar would actually increment for a completed file upload
edit: actually I see the issue, the info calls were previously batched and are done sequentially now, will send a PR with fixes
The updating meta issue should be fixed with dvc-data==0.35.1
(https://github.com/iterative/dvc/pull/8857)
Thanks everyone for your help solving this so quickly! Looks good now, closing.
Bug Report
Description
When I push to a worktree remote, I often get stuck in this state for minutes (I have not left it hanging long enough to see if it eventually completes):
Reproduce
I'm not yet sure how to create a minimal example that reproduces it, but it happens often when testing. Here's the steps I have taken to reproduce it:
dvc get dvc get git@github.com:iterative/dataset-registry.git use-cases/cats-dogs
to get some data.dvc add cats-dogs
to track the data.dvc push
to that worktree remote.And here's a video of it:
https://user-images.githubusercontent.com/2308172/213191889-d9ca22d0-608f-4b16-9460-360f21368d53.mov
Output of
dvc doctor
: