iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

[DRAFT] Explore skipping graph checks #10425

Open Erotemic opened 1 month ago

Erotemic commented 1 month ago

A PR where I'm going to experiment with solutions to the issue discussed in https://github.com/iterative/dvc/discussions/10415

The basic idea is that running dvc add (and other commands) walk the entire non-dvc indexed repo to look for DVC files. This is to handle granular modifications, but handling that use-case may not always be necessary, and it would be nice to disable these checks, which would allow the runtime of dvc add and other similar operations to have speed independent of other files indexed in the repo.

Not much in this PR yet, but I'm pushing it up as a DRAFT while I work on it.

Erotemic commented 1 month ago

@dberenbaum You mentioned I would want to skip repo.find_outs_by_path in dvc/repo/add.py under get_or_create_stage:

def get_or_create_stage(
    repo: "Repo",
    target: str,
    out: Optional[str] = None,
    to_remote: bool = False,
    force: bool = False,
) -> StageInfo:
    if out:
        target = resolve_output(target, out, force=force)
    path, wdir, out = resolve_paths(repo, target, always_local=to_remote and not out)

    try:
        (out_obj,) = repo.find_outs_by_path(target, strict=False)
        stage = out_obj.stage
        if not stage.is_data_source:
            raise DvcException(...omit...)
        return StageInfo(stage, output_exists=True)
    except OutputNotFoundError:
        stage = repo.stage.create(...omit...)
        return StageInfo(stage, output_exists=False)

I don't have a strong understanding of what a "Stage" is, so its unclear what the correct mechanism for skipping that line is, as it defines the "stage" object used in the subsequent lines. I could just raise a OutputNotFoundError exception before calling that line, which will then call repo.stage.create. I'm not sure if that is desirable or not.

I found some docs describing stages. I think in my use-case, there are never any stages, as they seem to correspond to steps in some pipeline. In the simple case where you only ever use add, push, and pull, is it correct that there are no stages? Any guidance on the right way to short circuit this?