add: performance and reliability issues

skshetry commented 3 years ago

[x] Repeated dvc add is not skipped.
```
$ dvc add data
$ dvc add data
```
In 1.X, it'd have been skipped. And, dvc still deletes the file and tries to restore it from the cache making it slower.
[x] DVC uses move-then-checkout logic. It moves the file from the workspace to the cache and then checks it out again, rather than just using copy.

This is slow and might result in data loss if it happens to fail in between the operations.
[x] DVC deletes the stage file, before even adding those files. This means that if the dvc add operation fails, the existing pointer file is lost, which is the only way to get access to the data.
[x] DVC resets the stages multiple times (only if multiple targets are provided) and forces the stage recollection which is slow.
[x] To the same effect, it resets the internal state of the repo after creating each stage, which also happens to reset dulwich's ignore manager, making it horribly slow if using too many targets (or, -R).

https://github.com/iterative/dvc/blob/4e792ae61c5927ab2e5f6a6914d985d43aa705b4/dvc/repo/add.py#L266

pared commented 3 years ago

DVC uses move-then-checkout logic. It moves the file from the workspace to the cache and then checks it out again, rather than just using copy.

Wasn't this intended to enforce cache link type? I guess in case of copy it would make sense but what about others?

skshetry commented 3 years ago

For other links, the one I suggested was to change copy behaviour to be move + link that works atomically. @efiop also suggested using hardlinks instead.

dberenbaum commented 2 years ago

@skshetry Do you think we should include this as part of the data epic?

skshetry commented 2 months ago

Closed by

and, released in https://github.com/iterative/dvc/releases/tag/3.54.0.

iterative / dvc

add: performance and reliability issues #6227