tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

The box file is overwritten in training process #336

Open vishakraj25 opened 1 year ago

vishakraj25 commented 1 year ago

Hello,

I followed the training procedure, there I generated the .gt.txt and .box files for the line images with help of tesseract

Then, I corrected/annotated the .gt.txt and .box files and added them in the data directory and started the training

Then, In the training process, all the .box files are overwritten. - Why It is happening

For example, lets take this image,

MT_Bank_1_22

and the corresponding box file which is overwritten in the training process is

Screenshot from 2023-02-24 11-57-51

Here, in the .box file, I did not annotated the 7th line, which contains \t. All the .box files contains the \t in the end.

If we assume, that the reason for \t at the end, because image has more spaces in the end of the line

Then, there are spaces in the start of the line too in image, but in .box file there is no spaces or tabs at the start

What is the concept flow for the .box annotation files?

Is it possible to stop the overwritten of the .box annotations in the training process?

Thanks

vishakraj25 commented 1 year ago

And, the coordinates are same for all characters, but the box file should have separate coordinates for each characters, isn't it

zdenop commented 1 year ago

Please provide example case for replicating problem. Next: which training procedure you followed? Please provide link.

vishakraj25 commented 1 year ago

Hi, I followed the training procedure mentioned in the readme file in this repo, with help of this tutorial - https://www.youtube.com/watch?v=KE4xEzFGSU8 - this is has good content, understanding the training steps easily

And, I tried in new system also, today, the same issue happening again

khashashin commented 1 year ago

@vishakraj25 In my case, it turned out that I didn't even have to create any box files myself https://github.com/tesseract-ocr/tesstrain/issues/338#issuecomment-1487982907

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.