Closed alierenak closed 3 years ago
Dataset streaming has not been tested on any of the examples, so I'm not sure it works, especially for distributed training on TPUs.
I am working on this feature for several days. Especially, I am trying to implement Iterable Dataset which reads preprocessed data from Cloud. Is the problem about streaming or Iterable Dataset, you think? However, using Pytorch Iterable Dataset in distributed training could be tricky as it can be seen from this issue.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.10.0 (currently master)examples/pytorch/language-modeling/run_mlm_no_trainer.py
which is using AcceleratorWho can help
@sgugger @patil-suraj
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
I have modified small things in
examples/pytorch/language-modeling/run_mlm_no_trainer.py
and changes as follow (can be reached at https://github.com/akalieren/transformers-master)streaming_data=True
to Dataset Classtpu_num_cores argument
from xla_spawn.py sys.args since it throw arrow.The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
git clone https://github.com/akalieren/transformers-master
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
Note: Without xla_spawn, Accelerator use only one cores. Thats why I changed, with 1 core it is running but slow
Expected behavior
I expected to run training script with 8 cores with normal speed. But it is stoped at this point and not continue from here even without small changes.