tberg12 / ocular

Ocular is a state-of-the-art historical OCR system.
GNU General Public License v3.0
252 stars 48 forks source link

Language model training time #20

Closed DominikAgejev closed 5 months ago

DominikAgejev commented 5 months ago

Hi,

So, I tried training a language model with a dataset of about 765 book-length files (or around 400 MB). For context, I'm a CS student but new to training LMs.

I thought it wouldn't be a big deal because I ran the training with every file individually and it was over in less than 2 hours. However, over 12 hours later (on 8-core i5 8th gen + 8 GB RAM) it's still not finished.

I have a couple of questions.

Is it possible to view the progress of the training in some way?

How long should I expect the training to last with the dataset I have? How about if I use a third of the dataset? Does the training time grow exponentially with the dataset?

My GPU wasn't being utilized and I saw no option for hardware accelaration for LM training, only for font training, is there such an option somewhere? What's my best bet to increase training speed?

Thanks in advance, Dominik

DominikAgejev commented 5 months ago

Well, it sure looks like it was something abnormal since using 76 files the training took only 7 minutes.