iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

Reimport checkout #10388

Open dberenbaum opened 2 months ago

dberenbaum commented 2 months ago

Fixes https://github.com/iterative/dvc/issues/10255. This is a partial reversion of https://github.com/iterative/dvc/pull/9246. I'm not sure this is the cleanest or best way to do it, but it solves the problem.

It builds the hashfile object and tries to checkout from that, falling back to download if it fails.

Note that the build step is still slow for 2.x imports.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.46%. Comparing base (dd2b515) to head (d510ad6). Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #10388 +/- ## ========================================== - Coverage 90.72% 90.46% -0.27% ========================================== Files 501 501 Lines 38865 38885 +20 Branches 5618 5619 +1 ========================================== - Hits 35262 35176 -86 - Misses 2961 3038 +77 - Partials 642 671 +29 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dberenbaum commented 2 months ago

Ping @skshetry

skshetry commented 2 months ago

I'll need to investigate. But we set cache.dir to point to the local cache, which should have made the DVCFileSystem use that cache.

https://github.com/iterative/dvc/blob/d510ad612f0d45b52110977b14f352a5dfe4c064/dvc/dependency/repo.py#L154

I am a bit hesitant of this PR as it goes against https://github.com/iterative/dvc/pull/9246, and adds another (complex) codepath.

dberenbaum commented 2 months ago

Yes, I looked into it, and the local cache dir is set, but it's not enough because it's doing download/copy and not checkout.

download() is defined here:

https://github.com/iterative/dvc/blob/7df8b0c09b6c7d527083cacafc1802fd53c47900/dvc/fs/__init__.py#L47-L76

It calls generic.copy() (https://github.com/iterative/dvc-objects/blob/ae8b95d7ea3bbcf46fc5a09eddc7c4f614023d7b/src/dvc_objects/fs/generic.py#L69), which runs fs.get_file() operations on paths and knows nothing about the cache.

I don't think there is a much simpler way to do it than this PR, and I tried to make it as narrow as possible. It uses #9246 to rely on the local cache, building the object and passing it to dvc-data's checkout. We don't have imported_file.dvc, so there is no way to do a higher-level checkout operation AFAICT.

skshetry commented 2 months ago

which runs fs.get_file() operations on paths and knows nothing about the cache.

fs is a DVCFileSystem, and knows about the cache.

dberenbaum commented 2 months ago

fs is a DVCFileSystem, and knows about the cache.

Yes, but it is doing a copy from that fs to localfs. It is not linking, and even if I try to enable linking, it fails because it is working across filesystems.

Honzys commented 3 weeks ago

Hello, is there any update with this merge request? It would help us a ton if it gets merged.

dberenbaum commented 3 weeks ago

@skshetry Is there a good reason to keep this blocked? I'm happy to revert if we find time to resolve in a cleaner way later.