Closed mittpy closed 4 years ago
@mittpy I am not sure whether your approach can actually work. tesstrain is supposed to support training lines of text. The MNIST data set contains, as you know, images of single characters. I am not even sure if the tesseract model training facilities support this particular use case. To find out whether your idea might work, please try with a small sample of the whole data set. This may help to isolate the error.
Hi, @wrznr ,
Thanks for your answer. I am using 400 000 single-line images, each line (image) consists of n-number randomly selected digits from MNIST. I am not using single-character images.
It is ok with smaller number of images. That has already been tested.
Please post an example and please try training with 1000 lines as a primer. This could turn out to be an interesting experiment.
@mittpy Great! What is the number of images which still works? 10,000? 100,000?
Thanks, @wrznr ,
I have tried "fine tuning" with up to 100 000 single-line images and "retrain from scratch" with 1000 images. "Retrain from scratch" should be done with larger dataset ... That is why I went for 400 000 images.
It would be great if you could run training from scratch with 100,000 images. Maybe there is some kind of (OS-dependent) hard upper boundary.
@stweil Do you have experience with such huge data sets?
@mittpy, there is no upper limit for the number of images which are supported for training, but the default setting caches all line images in RAM. So on computers with up to 8 GiB RAM you might run out of memory when the image cache fills all available RAM. See issue #109 for details.
I have 16 GB RAM and I am using Windows Subsystem for Linux.
As Windows also requires some memory, this might be still not enough for caching of all line images in the Linux subsystem. Try the --sequential_training
option as explained in the documentation.
@stweil , @wrznr ,
You've been really helpful, guys :) The --sequential_training option will be given a try after the attempt to "retrain from scratch" with 300 000 images finishes. I will keep you informed. Thanks a lot :)
@stweil ,
Is --sequential_training option appropriate in case one is using 250 styles of handwritten digits, let's say 250 shrifts? The documentation says: "If ... data ..., ... is all from the SAME STYLE (a handwritten manuscript book for instance) then you can use the --sequential_training flag for lstmtraining." sequential training
I suggest to simply try it, as I don't have personal experience with it.
Hi, folks,
I would highly appreciate your help. I am trying to train Tesseract 4 from scratch on 400 000 single-line images containing characters from MNIST dataset of handwritten digits.
The training could not finish (with 400 000 or 350 000 single-line images) and ends up with "cannot allocate memory". I have 16 GB RAM and I am using Windows Subsystem for Linux.
Your help would be invaluable. Thanks in advance!