repro: DVC is slow with million of files

nsorros commented 2 years ago

Description

We are experiencing some issues with DVC in a task that produces 3M files as an output. For context these are embeddings from chunks of documents. In this situation some commands error while other take a lot of time to complete which is make working with dvc not an option. To be fair producing 3M files that need to be hashed every time is understandably above the limits DVC expects.

I have not been able to reproduce all problems below but let me mention them briefly

dvc status takes 20+ minutes to calculate hashes
dvc repro fails to complete. the command finishes fine but some step after creates an invisible error
git commit with the pre commit hook takes minutes since it checks the hashes before switching branch
dvc pull throws ERROR: failed to transfer 'md5: xxx' - Could not connect to the endpoint URL: xxx in a lot of files
git push with the pre push hook takes minutes so the connection to GitHub is lost as dvc is pushing files

For 3 I ended up removing the pre commit hook. For 4 I had to increase the file number limit with ulimit -n 1024. For 5 I ran dvc push before git push For 2 I am not sure what caused the error, it could be related to number of files opened but still investigating

To reproduce I wrote a simple script that produces 1M random numpy vectors and saves them. I am including that below.

I noticed that dvc repro takes minutes, sometimes hours to complete even when it does not run the command because the stage is cached. I wonder whether DVC should throw a ⚠️ warning in cases where a user runs a command that makes it work outside some limits, for example 100K files. This warning could be thrown when DVC goes into the process of calculating hashes and it could redirect into a troubleshooting page for working with many files.

I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.

Finally another suggestion related to 4 is that the problem seem to be about too many open files but the probe to the troubleshooting guide only came at the end. The error itself was confusing in that it seemed like the remote was not working properly. If DVC can detect that too many files are open and change the error accordingly, this would be helpful. This is because if someone stops the operation early (as I was doing at start) they never get to see the recommendation in the end which points to the right solution.

Reproduce

scale.py

import argparse
import pickle
import os

from tqdm import tqdm
import numpy as np

def scale(files_count, output_path, embedding_dim=100):
    # Cause DVC deletes it because its output
    if not os.path.exists(output_path):
        os.mkdir(output_path)

    for i in tqdm(range(files_count)):
        embedding = np.random.randn(embedding_dim)

        embedding_path = os.path.join(output_path, f"embedding_{i}.pkl")
        with open(embedding_path, "wb") as f:
            f.write(pickle.dumps(embedding))

if __name__ == "__main__":
    argparser = argparse.ArgumentParser()
    argparser.add_argument("files_count", type=int, help="number of files to create")
    argparser.add_argument("output_path", type=str, help="path to save created files")
    args = argparser.parse_args()

    scale(args.files_count, args.output_path)

dvc.yaml

stages:
  scale:
    cmd: python scale.py 1000000 embeddings
    deps:
      - scale.py
    outs:
      - embeddings/

Expected

dvc repro could throw a warning at the point where it would start calculating hashes. Same for dvc status.

WARNING: Calculating 1M hashes is expected to be slow. Here are some tips on how to work with a lot of files LINK

Environment information

Output of dvc doctor:

DVC version: 2.9.5 (brew)
---------------------------------
Platform: Python 3.9.10 on macOS-12.3.1-arm64-arm-64bit
Supports:
    azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
    gdrive (pydrive2 = 1.10.0),
    gs (gcsfs = 2022.1.0),
    webhdfs (fsspec = 2022.1.0),
    http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
    https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
    s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
    ssh (sshfs = 2021.11.2),
    webdav (webdav4 = 0.9.4),
    webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

dtrifiro commented 2 years ago

Hi @nsorros, thank you for the detailed report.

A few questions about your points:

does dvc status always take a long time to run or just when the embeddings directory has been modified? If it is the latter case, an upcoming optimization (#7390) should speed up status considerably in this case
could you provide some more information? For example a report with verbose flag dvc repro -v
Might be related to 1.
How many CPU cores does do you have? python -c 'import os; print(os.cpu_count())'

I also wonder what is the recommended way to work in these situations. For once it seems that some or all hooks should be dropped. Then would it be quicker if the user zips the files to calculate the hash for the zip? Is there another workaround to speed up the hash calculation? The solution I see atm is removing outs or the stage all together.

Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects. You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:

zip_file = "/path/to/zip"
with ZipFile(zip_FILE) as archive:
    for file_name in archive.namelist():
        with z.open(file_name) as fh:
            data = fh.read()
        # do something with data

of course, this approach will not always be possible, depending on how you need to use the directory contents.

dtrifiro commented 2 years ago

nsorros commented 2 years ago

does dvc status always take a long time to run or just when the embeddings directory has been modified? If it is the latter case, an upcoming optimization (status: "recalculating" hashes each call #7390) should speed up status considerably in this case

It always takes time since it recalculates the hashes as I understand it to check if something has changed.

could you provide some more information? For example a report with verbose flag dvc repro -v

This might be difficult as the actual process that fails takes hours to complete but I will try to reproduce the problem in a different script to give you more information.

Might be related to 1.

I think so yes.

How many CPU cores does do you have? python -c 'import os; print(os.cpu_count())'

4 in the AWS instance (its a GPU instance) and 8 locally (Apple M1)

Creating an archive (zip, tar or gzip) and tracking it as an out instead of tracking it as a directory would speed up dvc considerably, since it would not require dealing with 1M+ objects. You could track the archive as an out, then stages that require the directory (now an archive file) as a dep, could use it like so:

I will try the zip approach to see how it speeds up things and come back.

iterative / dvc