Closed simsa-st closed 1 year ago
Btw I had to pin datasets package to <2.10.0 because I was getting this issue: https://github.com/apache/arrow/issues/34455. With that it works.
I am running trainings for 1 epoch to see if everything works now
Btw I found out that the hf dataset in arrow format is 40x bigger with the images so I think the chunking for RoBERTa synthetic dataset is really unnecessary. So I just increased the chunk_size
there from 10000 to 100000, effectively turning it off for synthetic pretraining.
The added logic is at the top of
prepare_hf_dataset
where dataset is preprocessed by chunks of 10000 documents. Storing to arrow format must be turned on so that it's possible to then concatenate the stored datasets. Although it was not failing for RoBERTa, only for LayoutLMv3, I made the changes there as well to make the files as in sync as possible. Also as a side effect, this change should significantly decrease RAM usage (at the cost of increasing disk usage but the preprocessed arrow dataset can be deleted after training).I also created a function
prepare_hf_dataset
to remove some duplicated code (loading ofval
and training datasets) and simplified the script arguments for loading/storing preprocessed/arrow datasets.