LocalPipelineExecutor does not use cpu cores

elifssamplespace commented 1 month ago

I am trying to process a CC dump using the LocalPipelineExecutor. My setup includes 6 files in the dump and a VM with 48 CPU cores. I run the code with 6 tasks and 48 workers, What I expect is that 48 cores should be utilized efficiently. Only 6 cores are actively processing the tasks.

Code:

    executor = LocalPipelineExecutor(
    pipeline=[
        JsonlReader(data_folder=f"{cc_path}/{dump}",text_key="raw_content"),
        URLFilter(),
        GopherRepetitionFilter(language = "tr"),
        GopherQualityFilter( language = "tr"),
        C4QualityFilter(filter_no_terminal_punct=False,
                       language = "tr"),
        C4BadWordsFilter(default_language = "tr"),
        PIIFormatter(),
        JsonlWriter(
            output_folder=f"{out_path}/out-text-process-4/{dump}"
        )
    ],
    logging_dir="logs",
    workers=48,
    tasks=6
)
    executor.run()

How can I use all cores to process data?

justHungryMan commented 1 month ago

Since maximum task is 6, if you try to use 48 workers, only 6 workers get task and run.

guipenedo commented 1 month ago

Hi, we only multiprocess on the individual file level. So if you have 1 task processing 1 file, giving it more CPUs will not speed up the processing. The way to go faster is to have more (smaller) input files so that you can have more tasks in total

huggingface / datatrove

LocalPipelineExecutor does not use cpu cores #240