Closed skshetry closed 2 months ago
DVC uses move-then-checkout logic. It moves the file from the workspace to the cache and then checks it out again, rather than just using copy.
Wasn't this intended to enforce cache link type? I guess in case of copy
it would make sense but what about others?
For other links, the one I suggested was to change copy behaviour to be move + link
that works atomically.
@efiop also suggested using hardlinks instead.
@skshetry Do you think we should include this as part of the data epic?
Closed by
and, released in https://github.com/iterative/dvc/releases/tag/3.54.0.
[x] Repeated
dvc add
is not skipped.In 1.X, it'd have been skipped. And, dvc still deletes the file and tries to restore it from the cache making it slower.
[x] DVC uses move-then-checkout logic. It moves the file from the workspace to the cache and then checks it out again, rather than just using copy.
This is slow and might result in data loss if it happens to fail in between the operations.
[x] DVC deletes the stage file, before even adding those files. This means that if the
dvc add
operation fails, the existing pointer file is lost, which is the only way to get access to the data.[x] DVC resets the stages multiple times (only if multiple targets are provided) and forces the stage recollection which is slow.
[x] To the same effect, it resets the internal state of the repo after creating each stage, which also happens to reset dulwich's ignore manager, making it horribly slow if using too many targets (or,
-R
).https://github.com/iterative/dvc/blob/4e792ae61c5927ab2e5f6a6914d985d43aa705b4/dvc/repo/add.py#L266