tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

After running the make training command, only the all-boxes file is created. #275

Closed ozlem-atiz closed 2 years ago

ozlem-atiz commented 3 years ago

Hello there please help me. I created my dataset. Then, when I ran the "make training" command, all-boxes file and .box files of each .tif file were created. The traineddata file and other files with it were not created. I am getting the following error after the make training command:

unicharset_extractor: symbol lookup error: unicharset_extractor: undefined symbol: _ZN10UNICHARSET11null_scriptE Makefile:103: recipe for target 'data/unicharset' failed

1 2

@kba @wrznr @nebiyebln

wrznr commented 3 years ago

Okay, you are mixing two problems here. tesstrain is meant to train models for Tesseract's text recognition part, which since version 4 operates on the line level. It is not made to train end-to-end OCR workflows. The data your are providing as input to the training are real world images which contain e.g. multiline plates, skewed lines, plate surroundings etc. Tesseract has text localization mechanisms which localize text in such images and then apply text recognition on areas in the image which contain text. However, these mechanisms are not yet trainable and in particular not with tesstrain. Sorry!

ozlem-atiz commented 3 years ago

Can we get around this problem if we remove square plates as well as multi-line plates?

ozlem-atiz commented 3 years ago

our project is license plate recognition system. We tried to train tesseract because we could not get very accurate results when we had the plate read. Is there any other way? So we want to get close to 100% accuracy. PLEASE HELP ME @wrznr

ozlem-atiz commented 3 years ago

https://www.kaggle.com/tustunkok/synthetic-turkish-license-plates do you think reading efficiency for real license plates will increase if we use them?

wrznr commented 3 years ago

Can we get around this problem if we remove square plates as well as multi-line plates?

What you will need is a tool for the “layout” analysis (i.e. line detection) of the plate images. You can test whether Tesseract does a good job here if you feed an image on the command line and inspect the ALTO output which shows the coordinates of the detected text regions.

do you think reading efficiency for real license plates will increase if we use them?

Yes, this could be a way. However, the images provided there are very clean. Most likely real-world number sequences will look different.

ozlem-atiz commented 3 years ago

Can we get around this problem if we remove square plates as well as multi-line plates?

What you will need is a tool for the “layout” analysis (i.e. line detection) of the plate images. You can test whether Tesseract does a good job here if you feed an image on the command line and inspect the ALTO output which shows the coordinates of the detected text regions.

do you think reading efficiency for real license plates will increase if we use them?

Yes, this could be a way. However, the images provided there are very clean. Most likely real-world number sequences will look different.

I have not heard of this method before. Could you please give a source? I couldn't understand what happened

wrznr commented 3 years ago

This the usual OCR workflow: Localizing the text (i.e. layout and line detection) takes place before recognizing the text. You may want to have a look at introductory volumes like https://www.worldscientific.com/worldscibooks/10.1142/2757 or https://www.springer.com/gp/book/9780792384922

wrznr commented 3 years ago

In addition, https://towardsdatascience.com/a-gentle-introduction-to-ocr-ee1469a201aa looks promising for your task.

wrznr commented 3 years ago

But first up, test Tesseract (I used https://de.wiktionary.org/wiki/number_plate#/media/Datei:KFZmod.png):

$ tesseract ~/Downloads/KFZmod.png - -l eng alto

returns

<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
    <Description>
        <MeasurementUnit>pixel</MeasurementUnit>
        <sourceImageInformation>
            <fileName></fileName>
        </sourceImageInformation>
        <OCRProcessing ID="OCR_0">
            <ocrProcessingStep>
                <processingSoftware>
                    <softwareName>tesseract 5.0.0-alpha-20210401</softwareName>
                </processingSoftware>
            </ocrProcessingStep>
        </OCRProcessing>
    </Description>
    <Layout>
        <Page WIDTH="3073" HEIGHT="632" PHYSICAL_IMG_NR="0" ID="page_0">
            <PrintSpace HPOS="0" VPOS="0" WIDTH="3073" HEIGHT="632">
                <ComposedBlock ID="cblock_0" HPOS="609" VPOS="71" WIDTH="2399" HEIGHT="463">
                    <TextBlock ID="block_0" HPOS="609" VPOS="71" WIDTH="2399" HEIGHT="463">
                        <TextLine ID="line_0" HPOS="609" VPOS="71" WIDTH="2399" HEIGHT="463">
                            <String ID="string_0" HPOS="609" VPOS="71" WIDTH="1165" HEIGHT="460" WC="0.0" CONTENT="AKL"/><SP WIDTH="124" VPOS="71" HPOS="1774"/>
                            <String ID="string_1" HPOS="1898" VPOS="85" WIDTH="1110" HEIGHT="449" WC="0.95" CONTENT="8136"/>
                        </TextLine>
                    </TextBlock>
                </ComposedBlock>
                <ComposedBlock ID="cblock_1" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629">
                    <TextBlock ID="block_1" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629">
                        <TextLine ID="line_1" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629">
                            <String ID="string_2" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629" WC="0.95" CONTENT=" "/>
                        </TextLine>
                    </TextBlock>
                </ComposedBlock>
            </PrintSpace>
        </Page>
    </Layout>
</alto>

You can use Aletheia or PageViewer to visualize the XML output. Most of you accuracy problems will boil down to suboptimal localization of the text and not to its erroneous recognition.

ozlem-atiz commented 3 years ago

I am using Qt-box-editor .I was able to edit incorrectly received characters with this app. But for usage, it creates one .traineddata file for each image. If I have 100s of .traineddata files, I probably won't be able to combine them all. Well, I don't know how I can do it. I looked at your sources. The PageViewer tool is suitable for my project. But the application does not open.

But first up, test Tesseract (I used https://de.wiktionary.org/wiki/number_plate#/media/Datei:KFZmod.png):

$ tesseract ~/Downloads/KFZmod.png - -l eng alto

returns

<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
  <Description>
      <MeasurementUnit>pixel</MeasurementUnit>
      <sourceImageInformation>
          <fileName></fileName>
      </sourceImageInformation>
      <OCRProcessing ID="OCR_0">
          <ocrProcessingStep>
              <processingSoftware>
                  <softwareName>tesseract 5.0.0-alpha-20210401</softwareName>
              </processingSoftware>
          </ocrProcessingStep>
      </OCRProcessing>
  </Description>
  <Layout>
      <Page WIDTH="3073" HEIGHT="632" PHYSICAL_IMG_NR="0" ID="page_0">
          <PrintSpace HPOS="0" VPOS="0" WIDTH="3073" HEIGHT="632">
              <ComposedBlock ID="cblock_0" HPOS="609" VPOS="71" WIDTH="2399" HEIGHT="463">
                  <TextBlock ID="block_0" HPOS="609" VPOS="71" WIDTH="2399" HEIGHT="463">
                      <TextLine ID="line_0" HPOS="609" VPOS="71" WIDTH="2399" HEIGHT="463">
                          <String ID="string_0" HPOS="609" VPOS="71" WIDTH="1165" HEIGHT="460" WC="0.0" CONTENT="AKL"/><SP WIDTH="124" VPOS="71" HPOS="1774"/>
                          <String ID="string_1" HPOS="1898" VPOS="85" WIDTH="1110" HEIGHT="449" WC="0.95" CONTENT="8136"/>
                      </TextLine>
                  </TextBlock>
              </ComposedBlock>
              <ComposedBlock ID="cblock_1" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629">
                  <TextBlock ID="block_1" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629">
                      <TextLine ID="line_1" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629">
                          <String ID="string_2" HPOS="2" VPOS="2" WIDTH="3070" HEIGHT="629" WC="0.95" CONTENT=" "/>
                      </TextLine>
                  </TextBlock>
              </ComposedBlock>
          </PrintSpace>
      </Page>
  </Layout>
</alto>

You can use Aletheia or PageViewer to visualize the XML output. Most of you accuracy problems will boil down to suboptimal localization of the text and not to its erroneous recognition.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.