Closed nsorros closed 8 months ago
Hi @nsorros, thank you for the detailed report.
A few questions about your points:
dvc status
always take a long time to run or just when the embeddings
directory has been modified? If it is the latter case, an upcoming optimization (#7390) should speed up status considerably in this casedvc repro -v
python -c 'import os; print(os.cpu_count())'
I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.
Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects. You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:
zip_file = "/path/to/zip"
with ZipFile(zip_FILE) as archive:
for file_name in archive.namelist():
with z.open(file_name) as fh:
data = fh.read()
# do something with data
of course, this approach will not always be possible, depending on how you need to use the directory contents.
- does
dvc status
always take a long time to run or just when theembeddings
directory has been modified? If it is the latter case, an upcoming optimization (status: "recalculating" hashes each call #7390) should speed up status considerably in this case
It always takes time since it recalculates the hashes as I understand it to check if something has changed.
- could you provide some more information? For example a report with verbose flag
dvc repro -v
This might be difficult as the actual process that fails takes hours to complete but I will try to reproduce the problem in a different script to give you more information.
- Might be related to 1.
I think so yes.
- How many CPU cores does do you have?
python -c 'import os; print(os.cpu_count())'
4 in the AWS instance (its a GPU instance) and 8 locally (Apple M1)
Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects. You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:
I will try the zip approach to see how it speeds up things and come back.
Other than the actual problems
We don't have capacity to work on this in the short to medium term. Also, this item is not very actionable and we have other focused tickets regarding this.
Closing for now.
Description
We are experiencing some issues with DVC in a task that produces 3M files as an output. For context these are embeddings from chunks of documents. In this situation some commands error while other take a lot of time to complete which is make working with dvc not an option. To be fair producing 3M files that need to be hashed every time is understandably above the limits DVC expects.
I have not been able to reproduce all problems below but let me mention them briefly
dvc status
takes 20+ minutes to calculate hashesdvc repro
fails to complete. the command finishes fine but some step after creates an invisible errorgit commit
with the pre commit hook takes minutes since it checks the hashes before switching branchdvc pull
throwsERROR: failed to transfer 'md5: xxx' - Could not connect to the endpoint URL: xxx
in a lot of filesgit push
with the pre push hook takes minutes so the connection to GitHub is lost as dvc is pushing filesFor
3
I ended up removing the pre commit hook. For4
I had to increase the file number limit withulimit -n 1024
. For5
I randvc push
beforegit push
For2
I am not sure what caused the error, it could be related to number of files opened but still investigatingTo reproduce I wrote a simple script that produces 1M random numpy vectors and saves them. I am including that below.
I noticed that
dvc repro
takes minutes, sometimes hours to complete even when it does not run the command because the stage is cached. I wonder whether DVC should throw a ⚠️ warning in cases where a user runs a command that makes it work outside some limits, for example 100K files. This warning could be thrown when DVC goes into the process of calculating hashes and it could redirect into a troubleshooting page for working with many files.I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.
Finally another suggestion related to
4
is that the problem seem to be about too many open files but the probe to the troubleshooting guide only came at the end. The error itself was confusing in that it seemed like the remote was not working properly. If DVC can detect that too many files are open and change the error accordingly, this would be helpful. This is because if someone stops the operation early (as I was doing at start) they never get to see the recommendation in the end which points to the right solution.Reproduce
scale.py
dvc.yaml
Expected
dvc repro
could throw a warning at the point where it would start calculating hashes. Same fordvc status
.WARNING: Calculating 1M hashes is expected to be slow. Here are some tips on how to work with a lot of files LINK
Environment information
Output of
dvc doctor
: