Closed PhilipMay closed 3 years ago
Well - I think this is the solution: https://huggingface.co/docs/datasets/loading_datasets.html#text-files
Hi PhilipMay
This is a repo for my personal research and there is no plan of adding feature to train or finetune on only text files. And yes, the link you pasted is the only solution currently.
there is no plan of adding feature to train or finetune on only text files.
I would like to understand what you exactly mean by this. As far as I understand I could use the solution from above to load a textfile as data and continue pretraining from a stored checkpoint. Is that right?
Yes, I just mean I won't add feature to directly train on text files without huggingface/datasets.
Hey @richarddwang would it be possible to provide a description how to use "just text files" for pretraining? I have a large sentence splitted file with blank line between documents and would like to domain adapt my electra model to my domainspecific corpus.
Your examples use these
hugdatafast
arrow datasets. How do I inject my own texts?Many thanks Philip