huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.02k stars 144 forks source link

Is this method implement only in the data parallel ? is there any pipeline parallel just like the model parallel ? #208

Closed WenhaoZhang-Git closed 5 months ago

WenhaoZhang-Git commented 5 months ago

Does this method implement the data parallel for the single node and multiple node ?

hynky1999 commented 5 months ago

Hey, I am heaving a bit hard time understanding the issue, could you elaborate more ? What method do you have in mind ?

WenhaoZhang-Git commented 5 months ago

Hey, I am heaving a bit hard time understanding the issue, could you elaborate more ? What method do you have in mind ?

Thank u answer, i mean, the execute method, the PipelineExecutor class method.

guipenedo commented 5 months ago

Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron

WenhaoZhang-Git commented 5 months ago

Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron

thank u reply

WenhaoZhang-Git commented 5 months ago

Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron

Thank u reply again, is there any data processing library or method for processing structured data such as the tabular data in the book ?

guipenedo commented 5 months ago

What book are you referring to?

WenhaoZhang-Git commented 5 months ago

What book are you referring to?

table_in_book textified_table

I have some the pdf type of book that include table data, but i got 'textified_table' derived from the 'table_in_book', when i convert the book.pdf to book.txt, i just wanna the pure text of the book, the no structure 'textified_table' that is some kinda of noisy data. maybe just convert pdf to md. how to deal with this kinda problem in current public book dataset for pretraining of llm , do you know ? thank u a lot, i very appreciate for your reply

guipenedo commented 5 months ago

We don't currently have any fix for this, our text extraction is intended for when you have an html document

WenhaoZhang-Git commented 5 months ago

We don't currently have any fix for this, our text extraction is intended for when you have an html document

Appreciate u candor