iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.78k stars 1.18k forks source link

`dvc list`: handle local repos differently? #3590

Closed ahmed-shariff closed 1 year ago

ahmed-shariff commented 4 years ago

When I run the command dvc list . from any sub-directory of the project I get the following error:

ERROR: failed to list '.' - Failed to clone repo '.' to '/tmp/tmp2qmfnem7dvc-clone': Cmd('git') failed due to: exit code(128)
  cmdline: git clone --no-single-branch -v . /tmp/tmp2qmfnem7dvc-clone
  stderr: 'fatal: repository '.' does not exist
'

Though it works when executed from the root directory of the project

DVC: 0.91.1 (arch linux;pip)


UPDATED (@shcheklein):

repurposed it a bit - https://github.com/iterative/dvc/issues/3590#issuecomment-612558038

efiop commented 4 years ago

Hi @ahmed-shariff !

dvc list expects to receive a URL to a git repo, which . isn't when you are not in the git repo root. Same as you can't git clone that directory. Theoretically, we could check if URL that you pass is a git repo subdir, but I'm not sure if it is worth the effort and it also will probably lead to misuse, where people would try to use it as ls in their subdirs.

ahmed-shariff commented 4 years ago

I see. Thank you for the clarification.

shcheklein commented 4 years ago

To be honest I think it makes sense to handle this as a special case:

@iterative/engineering @casperdcl thoughts?

(reopening, since it's annoying to remember the special syntax when I deal with the local repo, and other users were caught by surprise)

casperdcl commented 4 years ago

@shcheklein I agree

jamessergeant commented 4 years ago

I'm using a local remote and I receive the same error when running dvc list <path-to-local-remote>. Is this expected behaviour?

For reference the "local remote" is actually a mounted network drive.

ERROR: failed to list '/mnt/dr_dvc/vision/dataset_registry' - Failed to clone repo '/mnt/dr_dvc/vision/dataset_registry' to '/tmp/tmp9bkq340cdvc-clone': Cmd('git') failed due to: exit code(128)
  cmdline: git clone --no-single-branch -v /mnt/dr_dvc/vision/dataset_registry /tmp/tmp9bkq340cdvc-clone
  stderr: 'fatal: repository '/mnt/dr_dvc/vision/dataset_registry' does not exist
jorgeorpinel commented 4 years ago

Hi @jamessergeant, dvc list expects the path or URL to the DVC repository itself, not to a remote storage location. In fact I believe it doesn't check remotes at all to produce the list. Whether the data exists in remote storage is not guaranteed by dvc list. You have to attempt dvc get or dvc import to find out.

jorgeorpinel commented 4 years ago

p.s. guys I'm updating the dvc list cmd ref in iterative/dvc.org/pull/1174

andrewcstewart commented 4 years ago

Not sure if this belongs on this issue as well, but this also applies when running dvc list inside of a dvc repo that was created with dvc init --subdir.

efiop commented 4 years ago

@andrewcstewart Not directly. dvc list support for subrepos will be implemented as a part of https://github.com/iterative/dvc/issues/3369

efiop commented 3 years ago

For the record: we are no longer cloning local repos, opening them directly instead. The only thing left is to make the CLI convenient for local use. E.g.

dvc list # should be same as dvc list .
cd subdir && dvc list # should be same as dvc list . subdir
dvc list dir # should be the same as dvc list . subdir

it is a bit odd from the CLI argument semantics, as it will have to rely on some heuristics, but still should be pretty convenient. Alternative might be to make the url explicit, similar to early dvc list implementations, e.g. dvc list path_in_repo --url url, but that might be an even harder pill to swallow. Both approaches will raise questions about dvc import/get too, but those are clearly unusual to use locally.

efiop commented 3 years ago

Current CLI:

usage: dvc list [-h] [-q | -v] [-R] [--dvc-only] [--rev [<commit>]] url [path]      

and with the proposed heuristics it will be:

usage: dvc list [-h] [-q | -v] [-R] [--dvc-only] [--rev [<commit>]] [url] [path]

and if

a problem that we are creating here for future us - not being able to accept multiple targets (same problem as we have right now in list/get/import but now worse). An explicit --project/--url/etc for that would make it clearer. CC @dberenbaum @jorgeorpinel

jorgeorpinel commented 3 years ago

I'm all for unifying list, get, import UI

not being able to accept multiple targets (same problem as we have right now in list/get/import but now worse). An explicit --project/--url/etc for that would make it clearer

Not seeing a need to list multiple targets. Maybe get/import but what about using the import-url interface instead, where url includes location and path? That makes it easy to accept several ones.

BTW is this issue solved/outdated?

efiop commented 3 years ago

Not seeing a need to list multiple targets. Maybe get/import but what about using the import-url interface instead, where url includes location and path? That makes it easy to accept several ones.

@jorgeorpinel Doesn't work with git urls.

BTW is this issue solved/outdated?

Not the last part of it regarding handling local path as a target. Hence my questions.

dberenbaum commented 3 years ago

Not seeing a need to list multiple targets.

That's my initial thought. Are we aware of a need for this? If this was a new command, I'd prefer --url, but I wouldn't push to change it if there's no need.

Both approaches will raise questions about dvc import/get too, but those are clearly unusual to use locally.

By locally, you mean from inside the repo itself? I can't imagine dvc import . path being useful. Or are there other questions these changes raise about import/get?

efiop commented 3 years ago

That's my initial thought. Are we aware of a need for this? If this was a new command, I'd prefer --url, but I wouldn't push to change it if there's no need.

@dberenbaum No requests or anything yet. Just looking in the possible future :slightly_smiling_face:

By locally, you mean from inside the repo itself? I can't imagine dvc import . path being useful. Or are there other questions these changes raise about import/get?

Yep, from within the project or from another local project.

Btw, another interesting confusion is that people tend to use gs:// or s3:// or other dvc remote as an argument instead of git url. So maybe explicit --url or, better, --project flag would clarify the confusion in all of the commands. Btw, that would even open a possibility for future import import-url (and get get-url) unification into one command(dvc import and dvc get), since we'll have an explicit flag to differenciate the use cases of otherwise very similar commands. Though there has been some arguing about it even back when it was introduced (wish we had rfcs from back then :wink: ).

Anyway, a quick, local and intuitive solution is to go with that [url] [path] solution I've suggested above. If everyone is okay with it, of course.

jorgeorpinel commented 3 years ago

In short, make url optional, default to .? Sounds good, but ideally should apply to get/import* too (for UI consistency).

For the future, if --url helps unify get and import interfaces I'm all for it.

shcheklein commented 3 years ago

Getting back to this (as I'm playing more with it). It would significantly improve usability of the dvc list locally if make it ls semantics (recognized cwd automatically).

E.g. I was trying to see what outputs exist in the https://github.com/iterative/get-started-experiments/:

cd data
cd fashion-mnist
dvc list .

It returns root:

.dvcignore
.env
.gitignore
README.md
...
dvc.yaml
src

Trying:

dvc list . .

Also returns root.

dvc list . data/fashion-mnist/prepared

Fails:

ERROR: failed to list '.' - The path 'data/fashion-mnist/prepared' does not exist in the target repository '/Users/ivan/Projects/get-started-experiments' neither as a DVC output nor as a Git-tracked file.

dvc list -R . data/fashion-mnist

Also fails

and so on ... to be honest, I'm lost how can I list them at this point ... looks there are a few bugs + this behavior that is inconsistent depending on the (path, cwd) pair

shcheklein commented 3 years ago

I think this will be also part of the making list, diff, etc stable to integrate properly with VS Code.

skshetry commented 3 years ago

to be honest, I'm lost how can I list them at this point ... looks there are a few bugs + this behavior that is inconsistent depending on the (path, cwd) pair

The workaround for now is to do dvc list $(git root) <path relative to git root>.

machalx commented 3 years ago

Hi, any update on dvc list for the already cloned repo? It still retuns the project's root.

efiop commented 3 years ago

@machalx No updates so far, unfortunately 🙁

Toekan commented 2 years ago

Hi,

Thanks for your work on dvc.

I have a related question: I'm trying to use dvc get path/to/local_dvc_project name_of_file_to_download and I'm getting:

ERROR: failed to get 'name_of_file_to_download' from 'local_dvc_project' - Failed to clone repo 'local_dvc_project'.

Has there been work done on allowing to use dvc get without git? I saw people wondering about use cases: I want to use dvc to download some files during runtime of a docker container, at which point I have no git credentials. Ideally I would just use boto3, but at the moment I'm not sure how to reconstruct the path to the file I want to download.

pared commented 2 years ago

@Toekan lets move the conversation to #7270

dberenbaum commented 2 years ago

Hi, any update on dvc list for the already cloned repo? It still retuns the project's root.

You should be able to list the contents of the relative path now although the first argument will still be interpreted as the repo url (like dvc list . [relative_path]).

efiop commented 1 year ago

I'll close this for now.