Unexpected performance degradation behavior in minhash deduplication stage 2

huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Apache License 2.0

2.01k stars 143 forks source link

Stage 1 tasks vs. Dataset size	4TB	8TB
1200 tasks	71 minutes	123 minutes
2400 tasks	36 minutes	61 minutes

Stage 1 tasks vs. Dataset size

4TB

8TB

1200 tasks

71 minutes

123 minutes

2400 tasks

36 minutes

61 minutes

Stage 2 tasks vs. Dataset size	4TB (stage 1 done with 1200 tasks)	8TB (stage 1 done with 2400 tasks)
300 tasks	25 minutes	38 minutes
600 tasks	17 minutes	> 54 minutes <
1200 tasks	-	> 50 minutes <

Stage 2 tasks vs. Dataset size

4TB (stage 1 done with 1200 tasks)

8TB (stage 1 done with 2400 tasks)

300 tasks

25 minutes

38 minutes

600 tasks

17 minutes

> 54 minutes <

1200 tasks

> 50 minutes <

Hi, thank you for the benchmarks. Stage 1 is indeed fully parallelizeable hence the linear scaling. For stage2, performance also depends on the number of files/tasks from step1:

we use a priority queue onto which we load the smallest hash+document combo from each file
as we process each document, we add the next hash from the file where this document was in

for T tasks in stage 1, each processing N/T documents, processing in stage 2 would be O(N*log(T)). As you parallelize stage 2 into K tasks you should in theory get O(N*log(T)/K), as each task should process approximately the same number of hashes. The main factor not captured by the complexity here on T is that with an increase of T you will also have to go through a lot more files, and we actually only read a few lines at a time for each one to not have memory explode.

On your example you have kept T constant so we should expect linear scaling on K. Not sure why this was not the case but one possible reason could be filesystem issues, each task in step2 will open T files. And each of the T files in each bucket will be opened by all the tasks assigned to this bucket (K/nb of buckets tasks).

Are you able to run some sort of filesystem/file access benchmark for the 3 8TB configurations? Also, I assume the time in your benchmarks is the max out of each individual task, is this the case? Could you also provide the average per task (or the max if the values on your table are actually the average)

huggingface / datatrove

Unexpected performance degradation behavior in minhash deduplication stage 2 #298