Closed cooelf closed 3 years ago
Hi @cooelf, thanks for the comment.
ELECTRAProcessor
internally use datasets.Dataset.map
which reuses the cache if the cache file (named as cache_file_name
and placed under the cache directory where raw dataset placed, by default) exists. In short, it will reuse the processed dataset created at the first time it is called.
Please tag me if you still have questions.
Hi, thanks for the awesome repo!
In https://github.com/richarddwang/electra_pytorch/blob/master/pretrain.py#L142, ELECTRAProcessor will generate a f"electraowt{c.max_length}.arrow" file. I am wondering if it is possible to load the cached arrow data?
It would be helpful if the data would not be processed from scratch for each run. Could you give some hints?
Thanks!