dberenbaum commented 1 year ago

Bug Report

Description

When I push to a worktree remote, I often get stuck in this state for minutes (I have not left it hanging long enough to see if it eventually completes):

Reproduce

I'm not yet sure how to create a minimal example that reproduces it, but it happens often when testing. Here's the steps I have taken to reproduce it:

Run dvc get dvc get git@github.com:iterative/dataset-registry.git use-cases/cats-dogs to get some data.
Run dvc add cats-dogs to track the data.
Setup a worktree remote in a dvc repo.
dvc push to that worktree remote.

And here's a video of it:

https://user-images.githubusercontent.com/2308172/213191889-d9ca22d0-608f-4b16-9460-360f21368d53.mov

Output of dvc doctor:

$ dvc doctor
DVC version: 2.41.2.dev36+g5f558d003
---------------------------------
Platform: Python 3.10.2 on macOS-13.1-arm64-arm-64bit
Subprojects:
        dvc_data = 0.33.0
        dvc_objects = 0.17.0
        dvc_render = 0.0.17
        dvc_task = 0.1.11
        dvclive = 1.3.2
        scmrepo = 0.1.6
Supports:
        azure (adlfs = 2022.9.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2022.11.0),
        hdfs (fsspec = 2022.11.0+0.gacad158.dirty, pyarrow = 7.0.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2022.11.0+6.g804057f, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4),
        webhdfs (fsspec = 2022.11.0+0.gacad158.dirty)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

daavoo commented 1 year ago

@dberenbaum could you maybe share the profile dvc push --viztracer --viztracer-depth 8? It should still generate a profile when manually interrupted

I have been testing a larger number of files and for me, it's taking quite some time in building the data index https://github.com/iterative/dvc/blob/dd2d2dce198b9ee48ad27059932c4cb630f3bd0c/dvc/repo/worktree.py#L112 , when the folder added as remote worktree has a significant number of files

pmrowla commented 1 year ago

It's slow because we have to walk the worktree remote so we can do the diff to see what needs to be pushed and what needs to be deleted in the remote. We probably need to display some kind of message while it is building the index, but we can't use a regular progressbar because we don't know how many files there will be in the remote.

dberenbaum commented 1 year ago

Compare the video above to this one, which pushes a dataset ~3x bigger:

https://user-images.githubusercontent.com/2308172/213544969-d65a7701-314e-46f5-a6aa-c9c8cf989f45.mov

In the first video, I've checked that the "hanging" happens for ~5 minutes, while it lasts almost no time in the second video.

The only difference I can tell is that there are more previous versions of objects in the first video. Why should that matter here if we are only checking the current versions?

pmrowla commented 1 year ago

It's because in S3 you have to list all object versions if you want any versioning information at all

dberenbaum commented 1 year ago

🤔 Why isn't it a problem for version_aware remotes? Don't we need version info there also?

And do we need the versions when pushing to a worktree remote? If we just want to check the current version, can we use etags?

Edit: There also might be workarounds like https://stackoverflow.com/a/72281706

dberenbaum commented 1 year ago

@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:

viztracer.zip

pmrowla commented 1 year ago

🤔 Why isn't it a problem for version_aware remotes? Don't we need version info there also?

For version_aware we only need to check versions that we know about (because we think we pushed them already). For worktree we have to also check for files that exist in the remote but not in our repo (and then delete them). For version_aware we can just ignore files we don't know about at all.

The version_aware check will be faster in cases where using individual exists calls completes faster than listing the entire remote's worth of existing objects. In worktree we can't use exists since that doesn't help us w/deleting files that are in the remote but not in the DVC workspace.

daavoo commented 1 year ago

@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:

I think is because it's too big. In my computer Firefox breaks but chrome manages to render it

daavoo commented 1 year ago

@daavoo For some reason I'm getting an error trying to view this, but here's the JSON file:

viztracer.zip

Wow I have never seen this 😅 The profile is completely blanket inside checkout for 200s before reaching the sum:

https://user-images.githubusercontent.com/12677733/213715307-1cb5c3ef-1164-4989-9ed9-96f4d0780ccf.mov

Anyhow, looks like you ran before the changes made in #8842 :

Captura de Pantalla 2023-01-20 a las 15 04 12

Could you rerun with the latest version?

pmrowla commented 1 year ago

Edit: There also might be workarounds like https://stackoverflow.com/a/72281706

head-object only works for individual files, and it's what we already do for querying an individual object

If we just want to check the current version, can we use etags?

This could work, but will probably require some adjustment in dvc-data because we will essentially need to mix non-versioned filesystem calls with versioned ones.

The issue is that when listing a prefix, (which is the only way we can find files we don't know about) you can either do list_objects_v2 which does not return any version information at all, or list_object_versions which returns all object versions. The current fsspec s3fs implementation always uses list_object_versions when doing any directory listing, since it is the only way for s3fs to make sure it gets version information for everything.

So to get the "etags only" listing from S3 we need a non-versioned s3fs instance (to use list_objects_v2).

Also, this is only a problem on S3. The azure and gcs APIs are designed properly so that you can list only the latest versions of a prefix and get a response that includes version IDs for those latest versions. (Using the etag method won't improve performance vs using IDs on azure or gcs)

dberenbaum commented 1 year ago

~> For version_aware we only need to check versions that we know about (because we think we pushed them already). For worktree we have to also check for files that exist in the remote but not in our repo (and then delete them). For version_aware we can just ignore files we don't know about at all.~

~Don't we only need to get the current versions for that? Why do we need the version IDs of those files to know what to delete?~

Edit: addressed above by @pmrowla

I've narrowed down the problem a little -- it only happens on the initial push/when there is no existing cloud metadata.

Could you rerun with the latest version?

On the latest version, dvc push seems to wrongly think everything is up to date:

$ dvc push -r worktree cats-dogs
Everything is up to date.

$ cat cats-dogs.dvc
outs:
- md5: 22e3f61e52c0ba45334d973244efc155.dir
  size: 64128504
  nfiles: 2800
  path: cats-dogs

$ dvc config -l
core.analytics=false
remote.worktree.url=s3://dave-sandbox-versioning/test/worktree
remote.worktree.worktree=true

$ aws s3 ls --recursive s3://dave-sandbox-versioning/test/worktree/
# returns nothing

daavoo commented 1 year ago

On the latest version, dvc push seems to wrongly think everything is up to date:

https://github.com/iterative/dvc/issues/8852

dberenbaum commented 1 year ago

https://github.com/iterative/dvc/pull/8842 seems to have fixed the "hanging" issue and made worktree and version_aware remotes perform similarly.

However, it also slowed down the overall operation, spending a lot of time on "Updating meta for new files":

https://user-images.githubusercontent.com/2308172/213741985-804ee1a6-c815-4309-8144-116dd0408e70.mov

I'm assuming it's related to listing all object versions, but I'm not clear why it's so much worse in the new version.

Here are results that I think you can reproduce more or less by:

Enable access to AWS sandbox.
git clone https://github.com/dberenbaum/cloud-versioning-test
dvc pull
dvc push -r versioned or dvc push -r worktree

Before https://github.com/iterative/dvc/pull/8842:

After https://github.com/iterative/dvc/pull/8842:

skshetry commented 1 year ago

@dberenbaum, can you try running viztracer with --log_async please? See https://viztracer.readthedocs.io/en/latest/concurrency.html#asyncio.

skshetry commented 1 year ago

Anyway, it seems like it's calling fs.info() after checkout calls for all the files to update metadata which is taking time.

dberenbaum commented 1 year ago

@dberenbaum, can you try running viztracer with --log_async please? See https://viztracer.readthedocs.io/en/latest/concurrency.html#asyncio.

viztracer.zip

pmrowla commented 1 year ago

However, it also slowed down the overall operation, spending a lot of time on "Updating meta for new files":

@dberenbaum is your actual real-world time performance worse than it was before?

The "updating meta" behavior is the same as it was before, the only difference is that gets a separate progressbar now. Previously it was included in the overall push and showed up as an individual upload progressbar sitting at 100% before the "total push" bar would actually increment for a completed file upload

edit: actually I see the issue, the info calls were previously batched and are done sequentially now, will send a PR with fixes

pmrowla commented 1 year ago

The updating meta issue should be fixed with dvc-data==0.35.1 (https://github.com/iterative/dvc/pull/8857)

dberenbaum commented 1 year ago

Thanks everyone for your help solving this so quickly! Looks good now, closing.

iterative / dvc

cloud versioning: pushing to a worktree remote hangs #8836

Bug Report

Description

Reproduce