Closed WenhaoZhang-Git closed 5 months ago
Hey, I am heaving a bit hard time understanding the issue, could you elaborate more ? What method do you have in mind ?
Hey, I am heaving a bit hard time understanding the issue, could you elaborate more ? What method do you have in mind ?
Thank u answer, i mean, the execute method, the PipelineExecutor class method.
Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron
Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron
thank u reply
Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron
Thank u reply again, is there any data processing library or method for processing structured data such as the tabular data in the book ?
What book are you referring to?
What book are you referring to?
I have some the pdf type of book that include table data, but i got 'textified_table' derived from the 'table_in_book', when i convert the book.pdf to book.txt, i just wanna the pure text of the book, the no structure 'textified_table' that is some kinda of noisy data. maybe just convert pdf to md. how to deal with this kinda problem in current public book dataset for pretraining of llm , do you know ? thank u a lot, i very appreciate for your reply
We don't currently have any fix for this, our text extraction is intended for when you have an html document
We don't currently have any fix for this, our text extraction is intended for when you have an html document
Appreciate u candor
Does this method implement the data parallel for the single node and multiple node ?