Open Erotemic opened 1 month ago
@dberenbaum You mentioned I would want to skip repo.find_outs_by_path in dvc/repo/add.py under get_or_create_stage
:
def get_or_create_stage(
repo: "Repo",
target: str,
out: Optional[str] = None,
to_remote: bool = False,
force: bool = False,
) -> StageInfo:
if out:
target = resolve_output(target, out, force=force)
path, wdir, out = resolve_paths(repo, target, always_local=to_remote and not out)
try:
(out_obj,) = repo.find_outs_by_path(target, strict=False)
stage = out_obj.stage
if not stage.is_data_source:
raise DvcException(...omit...)
return StageInfo(stage, output_exists=True)
except OutputNotFoundError:
stage = repo.stage.create(...omit...)
return StageInfo(stage, output_exists=False)
I don't have a strong understanding of what a "Stage" is, so its unclear what the correct mechanism for skipping that line is, as it defines the "stage" object used in the subsequent lines. I could just raise a OutputNotFoundError
exception before calling that line, which will then call repo.stage.create
. I'm not sure if that is desirable or not.
I found some docs describing stages. I think in my use-case, there are never any stages, as they seem to correspond to steps in some pipeline. In the simple case where you only ever use add, push, and pull, is it correct that there are no stages? Any guidance on the right way to short circuit this?
A PR where I'm going to experiment with solutions to the issue discussed in https://github.com/iterative/dvc/discussions/10415
The basic idea is that running
dvc add
(and other commands) walk the entire non-dvc indexed repo to look for DVC files. This is to handle granular modifications, but handling that use-case may not always be necessary, and it would be nice to disable these checks, which would allow the runtime ofdvc add
and other similar operations to have speed independent of other files indexed in the repo.Not much in this PR yet, but I'm pushing it up as a DRAFT while I work on it.