tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

Cannot allocate memory when "retraining from scratch" #148

Closed mittpy closed 4 years ago

mittpy commented 4 years ago

Hi, folks,

I would highly appreciate your help. I am trying to train Tesseract 4 from scratch on 400 000 single-line images containing characters from MNIST dataset of handwritten digits.
The training could not finish (with 400 000 or 350 000 single-line images) and ends up with "cannot allocate memory". I have 16 GB RAM and I am using Windows Subsystem for Linux.

Your help would be invaluable. Thanks in advance!

wrznr commented 4 years ago

@mittpy I am not sure whether your approach can actually work. tesstrain is supposed to support training lines of text. The MNIST data set contains, as you know, images of single characters. I am not even sure if the tesseract model training facilities support this particular use case. To find out whether your idea might work, please try with a small sample of the whole data set. This may help to isolate the error.

mittpy commented 4 years ago

Hi, @wrznr ,

Thanks for your answer. I am using 400 000 single-line images, each line (image) consists of n-number randomly selected digits from MNIST. I am not using single-character images.

It is ok with smaller number of images. That has already been tested.

wrznr commented 4 years ago

Please post an example and please try training with 1000 lines as a primer. This could turn out to be an interesting experiment.

wrznr commented 4 years ago

@mittpy Great! What is the number of images which still works? 10,000? 100,000?

mittpy commented 4 years ago

Thanks, @wrznr ,

I have tried "fine tuning" with up to 100 000 single-line images and "retrain from scratch" with 1000 images. "Retrain from scratch" should be done with larger dataset ... That is why I went for 400 000 images.

wrznr commented 4 years ago

It would be great if you could run training from scratch with 100,000 images. Maybe there is some kind of (OS-dependent) hard upper boundary.

@stweil Do you have experience with such huge data sets?

stweil commented 4 years ago

@mittpy, there is no upper limit for the number of images which are supported for training, but the default setting caches all line images in RAM. So on computers with up to 8 GiB RAM you might run out of memory when the image cache fills all available RAM. See issue #109 for details.

stweil commented 4 years ago

I have 16 GB RAM and I am using Windows Subsystem for Linux.

As Windows also requires some memory, this might be still not enough for caching of all line images in the Linux subsystem. Try the --sequential_training option as explained in the documentation.

mittpy commented 4 years ago

@stweil , @wrznr ,

You've been really helpful, guys :) The --sequential_training option will be given a try after the attempt to "retrain from scratch" with 300 000 images finishes. I will keep you informed. Thanks a lot :)

mittpy commented 4 years ago

@stweil ,

Is --sequential_training option appropriate in case one is using 250 styles of handwritten digits, let's say 250 shrifts? The documentation says: "If ... data ..., ... is all from the SAME STYLE (a handwritten manuscript book for instance) then you can use the --sequential_training flag for lstmtraining." sequential training

stweil commented 4 years ago

I suggest to simply try it, as I don't have personal experience with it.