tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.48k stars 9.53k forks source link

Q&A: Technical drawings OCR #3807

Closed ViterAlex closed 2 years ago

ViterAlex commented 2 years ago

Environment

Current Behavior:

Doesn't recognize technical drawings

Expected Behavior:

Recognize dimensions text and symbols on drawings

Suggested Fix:

It's not an issue actually but question. I'm junior developer on .Net. I'm trying to OCR technical drawings (mechanicals). I'm looking for the tool of recognition. Is it possible to train tesseract engine on such specific symbols? frag02 frag03 frag04 frag01

Sorry, but it's really hard to find any information about this

wollmers commented 2 years ago

Current Behavior:

Doesn't recognize technical drawings

Tesseract mainly supports OCR of text documents.

It's not an issue actually but question. I'm junior developer on .Net. I'm trying to OCR technical drawings (mechanicals). I'm looking for the tool of recognition. Is it possible to train tesseract engine on such specific symbols?

You can train Tesseract on any symbols. The ones in the sample have Unicode code points:

$ uni print U+2316 U+2300 U+23CA U+27C2
     cpoint  dec    utf-8       html       name
'⌀'  U+2300  8960   e2 8c 80    ⌀   DIAMETER SIGN (Other_Symbol)
'⌖'  U+2316  8982   e2 8c 96    ⌖   POSITION INDICATOR (Other_Symbol)
'⏊'  U+23CA  9162   e2 8f 8a    ⏊   DENTISTRY SYMBOL LIGHT UP AND HORIZONTAL (Other_Symbol)
'⟂'  U+27C2  10178  e2 9f 82    ⟂   PERPENDICULAR (Math_Symbol)

I guess that there are a lot more symbols used, and some are not defined in Unicode. But you can use own code points in the PUA (Private Use Area).

You are left alone to train a model for technical drawings. Which one? CAD/CAM for machinery, buildings, electrical, plumbing, printed circuits, cartography, meteorology, military, gardening?

The standard models support languages and writing systems. That's what the majority of developers and contributors of Tesseract have expertise on.

Tesseract has problems removing the frames of table-like objects. You must remove them yourself during preprocessing.

Sorry, but it's really hard to find any information about this

It's an FAQ. You should ask such questions in the forum.

medhanshrath-t commented 1 year ago

Environment

  • 4.1.1:
  • Commit Number:
  • Windows:

Current Behavior:

Doesn't recognize technical drawings

Expected Behavior:

Recognize dimensions text and symbols on drawings

Suggested Fix:

It's not an issue actually but question. I'm junior developer on .Net. I'm trying to OCR technical drawings (mechanicals). I'm looking for the tool of recognition. Is it possible to train tesseract engine on such specific symbols? frag02 frag03 frag04 frag01

Sorry, but it's really hard to find any information about this

Hey did you have any luck in figuring this out?

amitdo commented 1 year ago

# Please use our forum for questions.