How to load the cached data from ELECTRAProcessor?

richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)

325 stars 42 forks source link

How to load the cached data from ELECTRAProcessor? #29

Closed cooelf closed 3 years ago

cooelf commented 3 years ago

Hi, thanks for the awesome repo!

In https://github.com/richarddwang/electra_pytorch/blob/master/pretrain.py#L142, ELECTRAProcessor will generate a f"electraowt{c.max_length}.arrow" file. I am wondering if it is possible to load the cached arrow data?

It would be helpful if the data would not be processed from scratch for each run. Could you give some hints?

Thanks!

richarddwang commented 3 years ago

Hi @cooelf, thanks for the comment.

ELECTRAProcessor internally use datasets.Dataset.map which reuses the cache if the cache file (named as cache_file_name and placed under the cache directory where raw dataset placed, by default) exists. In short, it will reuse the processed dataset created at the first time it is called.

Please tag me if you still have questions.