rossumai / docile

DocILE: Document Information Localization and Extraction Benchmark
https://docile.rossum.ai
MIT License
117 stars 9 forks source link

Baselines, fix LayoutLMv3 synthetic pretraining, process dataset chunks #60

Closed simsa-st closed 1 year ago

simsa-st commented 1 year ago

The added logic is at the top of prepare_hf_dataset where dataset is preprocessed by chunks of 10000 documents. Storing to arrow format must be turned on so that it's possible to then concatenate the stored datasets. Although it was not failing for RoBERTa, only for LayoutLMv3, I made the changes there as well to make the files as in sync as possible. Also as a side effect, this change should significantly decrease RAM usage (at the cost of increasing disk usage but the preprocessed arrow dataset can be deleted after training).

I also created a function prepare_hf_dataset to remove some duplicated code (loading of val and training datasets) and simplified the script arguments for loading/storing preprocessed/arrow datasets.

simsa-st commented 1 year ago

Btw I had to pin datasets package to <2.10.0 because I was getting this issue: https://github.com/apache/arrow/issues/34455. With that it works.

simsa-st commented 1 year ago

I am running trainings for 1 epoch to see if everything works now

simsa-st commented 1 year ago

Btw I found out that the hf dataset in arrow format is 40x bigger with the images so I think the chunking for RoBERTa synthetic dataset is really unnecessary. So I just increased the chunk_size there from 10000 to 100000, effectively turning it off for synthetic pretraining.