Open jordane95 opened 4 weeks ago
Hi, this is resolved for slurm executor by running a stats merger after all substasks are finished. I don't think there is a way to accomplish same behavior, as the global orchestration in local executor multi-node is not done by datatrove. Thus the responsibility of launching the merge script can't handled by datatrove.
If you log all stats into one folder you can use this script https://github.com/huggingface/datatrove/blob/main/src/datatrove/tools/merge_stats.py, which is exactly the script the slurm that slurm executor runs after all tasks have finished
Due to lack of node-level communication, the stats at the end of each pipeline step can only aggregate results from the current node, and what's being written to the disk is the status of the last finished worker, rather than the global info