Wrong stats in multi-node local executor

huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Apache License 2.0

2.03k stars 144 forks source link

Hi, this is resolved for slurm executor by running a stats merger after all substasks are finished. I don't think there is a way to accomplish same behavior, as the global orchestration in local executor multi-node is not done by datatrove. Thus the responsibility of launching the merge script can't handled by datatrove.

If you log all stats into one folder you can use this script https://github.com/huggingface/datatrove/blob/main/src/datatrove/tools/merge_stats.py, which is exactly the script the slurm that slurm executor runs after all tasks have finished

huggingface / datatrove

Wrong stats in multi-node local executor #297