Description to use "just text files".

richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)

324 stars 41 forks source link

Description to use "just text files". #14

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

Hey @richarddwang would it be possible to provide a description how to use "just text files" for pretraining? I have a large sentence splitted file with blank line between documents and would like to domain adapt my electra model to my domainspecific corpus.

Your examples use these hugdatafast arrow datasets. How do I inject my own texts?

Many thanks Philip

PhilipMay commented 3 years ago

Well - I think this is the solution: https://huggingface.co/docs/datasets/loading_datasets.html#text-files

richarddwang commented 3 years ago

Hi PhilipMay

This is a repo for my personal research and there is no plan of adding feature to train or finetune on only text files. And yes, the link you pasted is the only solution currently.

PhilipMay commented 3 years ago

there is no plan of adding feature to train or finetune on only text files.

I would like to understand what you exactly mean by this. As far as I understand I could use the solution from above to load a textfile as data and continue pretraining from a stored checkpoint. Is that right?

richarddwang commented 3 years ago

Yes, I just mean I won't add feature to directly train on text files without huggingface/datasets.