Add parallelism options to speed up dvc add

Ykid commented 4 years ago

One of the project I did contains directories of more than 1m files. When I do dvc add dir1, it takes quite a while to run and I saw that dvc only utilize one cpu to do so, it would be nice if there's parallelism options that can utilize multiple cores of the machine.

Ykid commented 4 years ago

current workaround: use multiple terminals to do dvc add

update: can't do so because there's a repo level lock.

efiop commented 4 years ago

Hi @Ykid !

current workaround: use multiple terminals to do dvc add

Not sure how that is going to help to parallelize single dvc add dir. Or maybe you meant dvc add dir1 dir2 ...?

ghost commented 4 years ago

Not sure how that is going to help to parallelize single dvc add dir. Or maybe you meant dvc add dir1 dir2 ...?

@efiop , it is dvc add dir_1 dir_2 dir_n (clear it out on the Discord conversation, forgot to update it here)

efiop commented 4 years ago

@Ykid Also, what dvc was showing when you were looking at CPU utilization? We do run things like checksum computation in parallel https://github.com/iterative/dvc/blob/master/dvc/remote/base.py#L178 .

ghost commented 4 years ago

@efiop , from the discord convo again, @Ykid mentioned 1 CPU. Not sure in what part of the process, didn't ask :see_no_evil:

efiop commented 4 years ago

Discord: https://discordapp.com/channels/485586884165107732/563406153334128681/667622134100787210

Ykid commented 4 years ago

@efiop this is what I got

ps aux | grep dvc
> some-user    1980 99.2  3.9 3483664 2634612 pts/14 Rl+ 13:57  56:44 python dvc add dir1

efiop commented 4 years ago

@Ykid Yes, but what was it doing at that time? Meaining what it was showing to stdout? Was it saying that it is computing checksums or maybe that it is linking?

Ykid commented 4 years ago

should be copying stuff

efiop commented 4 years ago

For the record: decided to try out hardlinks or symlinks as a workaround.

efiop commented 4 years ago

For the record: linking is still single-threaded in dvc, so need to look into parallelizing it as well.

Suor commented 4 years ago

@efiop might a part or a follow up after checkout refactor

arrowkato commented 3 years ago

I agree with Ykid.

As with dvc fetch and other commands, I hope that dvc add is added the --jobs option which allow it to execute multi- thread processing. It took over 15 minutes, when I ran dvc add dir1. dir1 have Over 100,000 files.

mvonpohle commented 2 years ago

I'm running DVC 2.10.2 and running checksum calculations for dvc add dvc status and dvc commit takes forever. I'm running this on a directory that has a tree-ish structure with 130 directories and ~150 files totally 1.1GB. (It's the result of a Keras-tuner hyper band experiment. Definitely a topical thing for DVC to handle)

From the above thread I'm under the impression that checksum calculations are done in parallel? I'm monitoring top while the checksums are calculating and I only ever see one instance of dvc. Am I missing something?

daavoo commented 2 years ago

Hi @mvonpohle , you can pip install viztracer and run dvc commit --viztracer-depth 8. You should see in the generated profile that multiple threads are being used.

mvonpohle commented 2 years ago

Ah, okay. I was looking for processes instead of threads. 😝 Thanks for the clarification!

Do you know if there's a way to give dvc more resources for these commands? I'd think checksum calculations would be ripe for aggressive parallelization.

skshetry commented 2 years ago

@mvonpohle, there is core.checksum_jobs config that you can set.

skshetry commented 2 years ago

Closing this issue, as add already uses multithreaded checksum calculation, transfer to cache is also multithreaded. The only part that is not multithreaded is checkout, which will get more boost from upcoming diff optimizations than being multithreaded (having multithreaded checkout is still a goal but it won't happen in short term).

ivartz commented 1 year ago

Hello all, I wonder if there is still possibilities to make a multiprocessed version of dvc add? This for utilizing more than one CPU core in a system, which should further speed up md5 calculations. Multithreading only speeds up md5 calculations within the single process that dvc add currently uses. Multiprocessing would be a great and necessary improvement of dvc add for large datasets.

shcheklein commented 1 year ago

This for utilizing more than one CPU core in a system, which should further speed up md5 calculations.

As far as I understand md5 calculation should be happening in the native C code and thus is not affected by the global lock. I would expect the same performance more or less in this case from a multithreaded / multiprocess implementation. @ivartz do you have any specific tests, or may be pefr profile, or even htop screen cast that can show that it's not being parallelized atm?

ivartz commented 1 year ago

Thanks for the quick reply. The step I'm referring to is "Building data objects". "Here is a test scenario: I have a large dataset (~100GB) with many small-sized files (500KB). I would like to add the datset like this using a local remote and no caching:

dvc remote add testlocalremote /home/ivar/Documents/testlocalremote dvc add --to-remote --remote testlocalremote sourcedata/treatment-deid

Htop when "Building data objects":

If there were no global lock for the steps in dvc add, something like the following bash script using multiple dvc add commands in parallell, should work:

`: ' bash dvc-add-dataset-mp.sh

Pararellized DVC add on sessions.

Assuming dataset root directory contains subjects/sessions in hierarchical structure, e.g., sub-01/ses-01 etc.

Example:

bash dvc-add-dataset-mp.sh sourcedata/treatment-deid 32


readarray -t sess < <(find $1 -mindepth 2 -maxdepth 2 -type d)
numsess=${#sess[*]}
nprocs=$2
process=0
while [[ $process -lt $numsess ]]
do
    # Start all processes
    for ((i=$process; i<$(($process + $nprocs)); ++i))
    do
        if [ $i -lt $numsess ] 
        then
            ses=${sess[$i]}
            cmd="dvc add --to-remote --remote testlocalremote $ses &"
            eval $cmd
            pids[$i]=$!
        fi
done
    # Wait for all processes to finish
    for pid in ${pids[*]}
    do
        wait $pid
    done
    process=$(($process + $nprocs))
done

However, this script gets multiple locking errors. I'm not able re reproduce these errors right now, but it is clear that this way of utilizing dvc add (dvc version 2.43.1) does not work correctly at this time.

Optimally, what I would like to achieve is to have dvc add utilize as much CPU cores as possible when running the command dvc add --to-remote --remote testlocalremote sourcedata/treatment-deid (or dvc add sourcedata/treatment-deid), for the objective of speeding up the dvc add command. I assume that from a technical side this should be possible and would significantly benefit most users having more than one CPU core/thread.

daavoo commented 1 year ago

As far as I understand md5 calculation should be happening in the native C code and thus is not affected by the global lock.

md5 calculation is not correctly isolated and it is currently bundled with other internal operations. multiprocessing could indeed speed up dvc add but the current logic would need to be refactored to be able to parallelize the parts that can benefit from it

shcheklein commented 1 year ago

thanks @ivartz !

Your intention makes total sense. What I'm trying to understand better is why it doesn't utilize multiple cores now and if this ticket should be reopened (even if we don't prioritize right away).

md5 calculation is not correctly isolated and it is currently bundled with other internal operations. multiprocessing could indeed speed up dvc add but the current logic would need to be refactored to be able to parallelize the parts that can benefit from it

thanks @daavoo ! should we reopen the ticket then? Since per @skshetry 's comment above almost everything is parallel and we are waiting for some diff changes to do the last part?

daavoo commented 1 year ago

Since per @skshetry 's comment above almost everything is parallel and we are waiting for some diff changes to do the last part?

The comment is outdated, multithreaded checksum calculation and transfer to cache is also multithreaded. are no longer True.

As of today, it happens sequentially in a single thread:

Captura de pantalla 2023-02-02 a las 14 06 19

skshetry commented 1 year ago

We removed it in https://github.com/iterative/dvc-data/pull/53, because there was high overhead for small files, and running in a single thread was much faster than multithreaded hashing/building. The state is to blame here, of course.

Also since we are using mnist dataset as a benchmark, there's also a question of whether it's a good representative to optimize for.

shcheklein commented 1 year ago

much faster

@skshetry @efiop any data to support that decision by chance (there is nothing in the ticket 🤔 )

Also since we are using mnist dataset as a benchmark, there's also a question of whether it's a good representative to optimize for.

yes, I don't think it's very relevant tbh. But even in that case it's a bit surprising - what would be the underlying reason for the pool to be slower?

efiop commented 1 year ago

@shcheklein Can't find anything recorded in the ticket, but there are today's benchmark runs on mnist dataset from dvc bench (see pic below, 2.11 is where the changes rolled out).

The way we were working with state is suboptimal, persistent index will be replacing it. add will be migrating to index-based workflow (cloud versioning already using it) and that logic is already based on async/threadpool https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py thanks to great work by @pmrowla With data status already using indexes and dvc diff almost migrated to it (https://github.com/iterative/dvc/pull/8930), dvc add will likely be next.

ivartz commented 1 year ago

Thanks for reopening this issue!

The way we were working with state is suboptimal, persistent index will be replacing it. add will be migrating to index-based workflow (cloud versioning already using it) and that logic is already based on async/threadpool https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py thanks to great work by @pmrowla With data status already using indexes and dvc diff almost migrated to it (#8930), dvc add will likely be next.

Does the async/threadpool here refer to using multiple threads within a single process, or will also additional processes be utilized by add when migrating to index-based workflow? For the first option, it will likely not improve the speedup of add according to my previous test case?

yes, I don't think it's very relevant tbh. But even in that case it's a bit surprising - what would be the underlying reason for the pool to be slower?

mnist is not representative for my data case as described above, and I argue that my case would be common for many users of dvc. Perhaps it is possible to add some logic for letting the size or amount of files determine whether to use multiprocessing or multithreading?

Thanks for all the replies. Looking forward to see if index-based workflow can speed up adding multiple small sized files (500KB) consisted within a directory (recursively) with a total size of 100GB.

efiop commented 1 year ago

@ivartz Sorry for the confusion. There are a few different ways we compute hashes (e.g. dvc add is not the same as dvc add --to-remote). And there is also transfer involved during dvc add, so I'm mixing things up myself as well.

We are filling up our dvc-bench project with some typical datasets and "many small files" (like mnist) was (and kinda is still) a big pain point, so we added it and used it to make decisions at the current stage.

Your data seems to be structurally similar though, or am I missing something? 100G of 500kb files means ~50K small files, which is not the same, but is kinda similar.

mattseddon commented 7 months ago

We can track this under #7607

iterative / dvc

Add parallelism options to speed up dvc add #3177