huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.88k stars 124 forks source link

LocalPipelineExecutor does not use cpu cores #240

Open elifssamplespace opened 1 month ago

elifssamplespace commented 1 month ago

I am trying to process a CC dump using the LocalPipelineExecutor. My setup includes 6 files in the dump and a VM with 48 CPU cores. I run the code with 6 tasks and 48 workers, What I expect is that 48 cores should be utilized efficiently. Only 6 cores are actively processing the tasks.

Code:

    executor = LocalPipelineExecutor(
    pipeline=[
        JsonlReader(data_folder=f"{cc_path}/{dump}",text_key="raw_content"),
        URLFilter(),
        GopherRepetitionFilter(language = "tr"),
        GopherQualityFilter( language = "tr"),
        C4QualityFilter(filter_no_terminal_punct=False,
                       language = "tr"),
        C4BadWordsFilter(default_language = "tr"),
        PIIFormatter(),
        JsonlWriter(
            output_folder=f"{out_path}/out-text-process-4/{dump}"
        )
    ],
    logging_dir="logs",
    workers=48,
    tasks=6
)
    executor.run()

How can I use all cores to process data?

justHungryMan commented 1 month ago

Since maximum task is 6, if you try to use 48 workers, only 6 workers get task and run.

guipenedo commented 1 month ago

Hi, we only multiprocess on the individual file level. So if you have 1 task processing 1 file, giving it more CPUs will not speed up the processing. The way to go faster is to have more (smaller) input files so that you can have more tasks in total