Open Maghoumi opened 1 week ago
Hi, thank you for the benchmarks. Stage 1 is indeed fully parallelizeable hence the linear scaling. For stage2, performance also depends on the number of files/tasks from step1:
for T tasks in stage 1, each processing N/T documents, processing in stage 2 would be O(N*log(T))
. As you parallelize stage 2 into K tasks you should in theory get O(N*log(T)/K)
, as each task should process approximately the same number of hashes. The main factor not captured by the complexity here on T is that with an increase of T you will also have to go through a lot more files, and we actually only read a few lines at a time for each one to not have memory explode.
On your example you have kept T constant so we should expect linear scaling on K. Not sure why this was not the case but one possible reason could be filesystem issues, each task in step2 will open T files. And each of the T files in each bucket will be opened by all the tasks assigned to this bucket (K/nb of buckets
tasks).
Are you able to run some sort of filesystem/file access benchmark for the 3 8TB configurations? Also, I assume the time in your benchmarks is the max out of each individual task, is this the case? Could you also provide the average per task (or the max if the values on your table are actually the average)
I've been running some large-scale benchmarking with minhash deduplication on SLURM clusters, loosely following this example
The benchmarks consist of running stages 1 and 2 with the following configurations:
What I'm observing is that
stage 1
seems to scale fairly linearly between these configs. I have the following timing values in the final stats file (all values are in minutes):However, for
stage 2
, the scaling becomes quite different, especially when running the 8TB configuration:As reported above, the some 8TB configs for
stage 2
(boldfaced) is taking an unexpectedly long time to run. I repeated these experiments several times, and the results appear consistent.I was wondering if this behavior is expected? If so, what could be a possible explanation? Let me know if I can provide further information.