Closed Ykid closed 7 months ago
current workaround: use multiple terminals to do dvc add
update: can't do so because there's a repo level lock.
Hi @Ykid !
current workaround: use multiple terminals to do dvc add
Not sure how that is going to help to parallelize single dvc add dir
. Or maybe you meant dvc add dir1 dir2 ...
?
Not sure how that is going to help to parallelize single dvc add dir. Or maybe you meant dvc add dir1 dir2 ...?
@efiop , it is dvc add dir_1 dir_2 dir_n
(clear it out on the Discord conversation, forgot to update it here)
@Ykid Also, what dvc was showing when you were looking at CPU utilization? We do run things like checksum computation in parallel https://github.com/iterative/dvc/blob/master/dvc/remote/base.py#L178 .
@efiop , from the discord convo again, @Ykid mentioned 1 CPU. Not sure in what part of the process, didn't ask :see_no_evil:
@efiop this is what I got
ps aux | grep dvc
> some-user 1980 99.2 3.9 3483664 2634612 pts/14 Rl+ 13:57 56:44 python dvc add dir1
@Ykid Yes, but what was it doing at that time? Meaining what it was showing to stdout? Was it saying that it is computing checksums or maybe that it is linking?
should be copying stuff
For the record: decided to try out hardlinks or symlinks as a workaround.
For the record: linking is still single-threaded in dvc, so need to look into parallelizing it as well.
@efiop might a part or a follow up after checkout refactor
I agree with Ykid.
As with dvc fetch and other commands, I hope that dvc add
is added the --jobs option which allow it to execute multi- thread processing.
It took over 15 minutes, when I ran dvc add dir1
. dir1
have Over 100,000 files.
I'm running DVC 2.10.2 and running checksum calculations for dvc add
dvc status
and dvc commit
takes forever. I'm running this on a directory that has a tree-ish structure with 130 directories and ~150 files totally 1.1GB. (It's the result of a Keras-tuner hyper band experiment. Definitely a topical thing for DVC to handle)
From the above thread I'm under the impression that checksum calculations are done in parallel? I'm monitoring top while the checksums are calculating and I only ever see one instance of dvc. Am I missing something?
Hi @mvonpohle , you can pip install viztracer
and run dvc commit --viztracer-depth 8
. You should see in the generated profile that multiple threads are being used.
Ah, okay. I was looking for processes instead of threads. 😝 Thanks for the clarification!
Do you know if there's a way to give dvc more resources for these commands? I'd think checksum calculations would be ripe for aggressive parallelization.
@mvonpohle, there is core.checksum_jobs
config that you can set.
Closing this issue, as add
already uses multithreaded checksum calculation, transfer
to cache is also multithreaded. The only part that is not multithreaded is checkout
, which will get more boost from upcoming diff
optimizations than being multithreaded (having multithreaded checkout is still a goal but it won't happen in short term).
Hello all, I wonder if there is still possibilities to make a multiprocessed version of dvc add? This for utilizing more than one CPU core in a system, which should further speed up md5 calculations. Multithreading only speeds up md5 calculations within the single process that dvc add currently uses. Multiprocessing would be a great and necessary improvement of dvc add for large datasets.
This for utilizing more than one CPU core in a system, which should further speed up md5 calculations.
As far as I understand md5
calculation should be happening in the native C code and thus is not affected by the global lock. I would expect the same performance more or less in this case from a multithreaded / multiprocess implementation. @ivartz do you have any specific tests, or may be pefr profile, or even htop
screen cast that can show that it's not being parallelized atm?
Thanks for the quick reply. The step I'm referring to is "Building data objects". "Here is a test scenario: I have a large dataset (~100GB) with many small-sized files (500KB). I would like to add the datset like this using a local remote and no caching:
dvc remote add testlocalremote /home/ivar/Documents/testlocalremote
dvc add --to-remote --remote testlocalremote sourcedata/treatment-deid
Htop when "Building data objects":
If there were no global lock for the steps in dvc add, something like the following bash script using multiple dvc add commands in parallell, should work:
`: '
bash dvc-add-dataset-mp.sh
Pararellized DVC add on sessions.
Assuming dataset root directory contains subjects/sessions in hierarchical structure, e.g., sub-01/ses-01 etc.
Example:
bash dvc-add-dataset-mp.sh sourcedata/treatment-deid 32
readarray -t sess < <(find $1 -mindepth 2 -maxdepth 2 -type d)
numsess=${#sess[*]}
nprocs=$2
process=0
while [[ $process -lt $numsess ]]
do
# Start all processes
for ((i=$process; i<$(($process + $nprocs)); ++i))
do
if [ $i -lt $numsess ]
then
ses=${sess[$i]}
cmd="dvc add --to-remote --remote testlocalremote $ses &"
eval $cmd
pids[$i]=$!
fi
done
# Wait for all processes to finish
for pid in ${pids[*]}
do
wait $pid
done
process=$(($process + $nprocs))
done
However, this script gets multiple locking errors. I'm not able re reproduce these errors right now, but it is clear that this way of utilizing dvc add (dvc version 2.43.1) does not work correctly at this time.
Optimally, what I would like to achieve is to have dvc add utilize as much CPU cores as possible when running the command dvc add --to-remote --remote testlocalremote sourcedata/treatment-deid
(or dvc add sourcedata/treatment-deid
), for the objective of speeding up the dvc add command. I assume that from a technical side this should be possible and would significantly benefit most users having more than one CPU core/thread.
As far as I understand md5 calculation should be happening in the native C code and thus is not affected by the global lock.
md5 calculation is not correctly isolated and it is currently bundled with other internal operations.
multiprocessing could indeed speed up dvc add
but the current logic would need to be refactored to be able to parallelize the parts that can benefit from it
thanks @ivartz !
Your intention makes total sense. What I'm trying to understand better is why it doesn't utilize multiple cores now and if this ticket should be reopened (even if we don't prioritize right away).
md5 calculation is not correctly isolated and it is currently bundled with other internal operations. multiprocessing could indeed speed up dvc add but the current logic would need to be refactored to be able to parallelize the parts that can benefit from it
thanks @daavoo ! should we reopen the ticket then? Since per @skshetry 's comment above almost everything is parallel and we are waiting for some diff changes to do the last part?
Since per @skshetry 's comment above almost everything is parallel and we are waiting for some diff changes to do the last part?
The comment is outdated, multithreaded checksum calculation
and transfer to cache is also multithreaded.
are no longer True.
As of today, it happens sequentially in a single thread:
We removed it in https://github.com/iterative/dvc-data/pull/53, because there was high overhead for small files, and running in a single thread was much faster than multithreaded hashing/building. The state is to blame here, of course.
Also since we are using mnist dataset as a benchmark, there's also a question of whether it's a good representative to optimize for.
much faster
@skshetry @efiop any data to support that decision by chance (there is nothing in the ticket 🤔 )
Also since we are using mnist dataset as a benchmark, there's also a question of whether it's a good representative to optimize for.
yes, I don't think it's very relevant tbh. But even in that case it's a bit surprising - what would be the underlying reason for the pool to be slower?
@shcheklein Can't find anything recorded in the ticket, but there are today's benchmark runs on mnist dataset from dvc bench (see pic below, 2.11 is where the changes rolled out).
The way we were working with state is suboptimal, persistent index will be replacing it. add will be migrating to index-based workflow (cloud versioning already using it) and that logic is already based on async/threadpool https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py thanks to great work by @pmrowla With data status
already using indexes and dvc diff
almost migrated to it (https://github.com/iterative/dvc/pull/8930), dvc add
will likely be next.
Thanks for reopening this issue!
The way we were working with state is suboptimal, persistent index will be replacing it. add will be migrating to index-based workflow (cloud versioning already using it) and that logic is already based on async/threadpool https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py thanks to great work by @pmrowla With
data status
already using indexes anddvc diff
almost migrated to it (#8930),dvc add
will likely be next.
Does the async/threadpool here refer to using multiple threads within a single process, or will also additional processes be utilized by add when migrating to index-based workflow? For the first option, it will likely not improve the speedup of add according to my previous test case?
yes, I don't think it's very relevant tbh. But even in that case it's a bit surprising - what would be the underlying reason for the pool to be slower?
mnist is not representative for my data case as described above, and I argue that my case would be common for many users of dvc. Perhaps it is possible to add some logic for letting the size or amount of files determine whether to use multiprocessing or multithreading?
Thanks for all the replies. Looking forward to see if index-based workflow can speed up adding multiple small sized files (500KB) consisted within a directory (recursively) with a total size of 100GB.
@ivartz Sorry for the confusion. There are a few different ways we compute hashes (e.g. dvc add
is not the same as dvc add --to-remote
). And there is also transfer involved during dvc add
, so I'm mixing things up myself as well.
We are filling up our dvc-bench
project with some typical datasets and "many small files" (like mnist) was (and kinda is still) a big pain point, so we added it and used it to make decisions at the current stage.
Your data seems to be structurally similar though, or am I missing something? 100G of 500kb files means ~50K small files, which is not the same, but is kinda similar.
We can track this under #7607
One of the project I did contains directories of more than 1m files. When I do
dvc add dir1
, it takes quite a while to run and I saw that dvc only utilize one cpu to do so, it would be nice if there's parallelism options that can utilize multiple cores of the machine.