huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.03k stars 144 forks source link

Wrong stats in multi-node local executor #297

Open jordane95 opened 4 weeks ago

jordane95 commented 4 weeks ago

Due to lack of node-level communication, the stats at the end of each pipeline step can only aggregate results from the current node, and what's being written to the disk is the status of the last finished worker, rather than the global info

hynky1999 commented 3 weeks ago

Hi, this is resolved for slurm executor by running a stats merger after all substasks are finished. I don't think there is a way to accomplish same behavior, as the global orchestration in local executor multi-node is not done by datatrove. Thus the responsibility of launching the merge script can't handled by datatrove.

If you log all stats into one folder you can use this script https://github.com/huggingface/datatrove/blob/main/src/datatrove/tools/merge_stats.py, which is exactly the script the slurm that slurm executor runs after all tasks have finished