Open GoogleCodeExporter opened 8 years ago
Dear curtisjohnston,
Sorry i cannot post all modifications, but I hope that some info below will be
able to help you.
In AnalyseLayout function, it only collect bounding-rect of recongized
layout-item.
If you really want to get confidence of recognized text, you can add-and-modify
some lines code as belows:
- 1. after call recognize() step, get and store the result iterator pointer as
in Recognize(...) method.
- 2. add more method CollectRecongnizedResults(ResultIteratorBase
resultIterator) in RecognitionItem
- 3. use tesseract api directly
* in your case ResultIteratorBase is ResutltIterator (ref. to baseapi.h for
GetIterator() method).
* all code (1)+(2) I have just make a wrapper on .net; so you have to port them
back to c++, i think it's not problem to you.
(1)
public virtual String Recognize(Image image, ref DocumentLayout doc)
{
_collectResultDetails = true;
try
{
// clear document
if (doc == null)
doc = new DocumentLayout();
else
doc.Blocks.Clear();
String txt = Recognize(image); //
// collect details here
if (_collectResultDetails && _resultIterator != null)
doc.CollectRecongnizedResults(_resultIterator);
return txt;
}
catch
{
throw;
}
finally
{
_collectResultDetails = false;
DisposeResultDetailCollector();
}
}
(2)
public virtual void CollectRecongnizedResults(ResultIteratorBase resultIterator)
{
ePageIteratorLevel curLevel = this.GetPageIteratorLevel();
// recongnized confidence
this.Confidence = 0.01 * resultIterator.GetConfidence(curLevel);
// get specific features
switch (_pageLevel)
{
case ePageIteratorLevel.RIL_SYMBOL:
String txt = resultIterator.GetUTF8Text(curLevel);
(this as Character).Value = (txt != null && txt.Length > 0 ? txt[0] : '$');
(this as Character).IsSuperscript = resultIterator.SymbolIsSuperscript();
(this as Character).IsSubscript = resultIterator.SymbolIsSubscript();
(this as Character).IsDropcap = resultIterator.SymbolIsDropcap();
break;
case ePageIteratorLevel.RIL_WORD:
(this as Word).Text = resultIterator.GetUTF8Text(curLevel);
RecognitionFont recognizedFont = resultIterator.GetWordFontAttributes();
(this as Word).RecognizedFont = recognizedFont;
(this as Word).IsNumeric = resultIterator.WordIsNumeric();
(this as Word).Direction = resultIterator.WordDirection();
break;
default:
break;
}
RecognitionItem child = this.CreateChild();
if (child == null) // it is lowest level
{
resultIterator.GetBoundingBox(
this.GetPageIteratorLevel(),
ref Left, ref Top, ref Right, ref Bottom);
return;
}
ePageIteratorLevel nextLevel = this.GetNextPageIteratorLevel();
resultIterator.GetBoundingBox(
curLevel, ref Left, ref Top, ref Right, ref Bottom);
if (resultIterator.IsAtBeginningOf(nextLevel))
{
// get the first item
child.CollectRecongnizedResults(resultIterator);
this.AddItem(child);
if (resultIterator.IsAtFinalElement(curLevel, nextLevel))
return;
// get remaining items
while (resultIterator.Next(nextLevel))
{
child = this.CreateChild();
child.CollectRecongnizedResults(resultIterator);
this.AddItem(child);
if (resultIterator.IsAtFinalElement(curLevel, nextLevel))
break;
}
}
}
(3)
String* OCRProcessor::Recognize(TessBaseAPI* api, Pix* pix)
{
if (api == null || pix == null)
return null;
// dispose result collector if possible
this->DisposeResultDetailCollector();
api->SetImage(pix);
bool succeed = api->Recognize(null) >= 0;
// if succeed and do collect result details
if (succeed && _collectResultDetails)
{
_resultIterator =
new ResultIteratorWrapper(api->GetIterator());
}
char* text = null;
String* result = null;
try
{
text = api->GetUTF8Text();
result = Helper::ToUTF8String(text);
}
catch (System::Exception* exp)
{
throw exp;
}
__finally
{
if (text != null)
{
delete[] text;
text = null;
}
}
return result;
}
Original comment by congnguy...@gmail.com
on 2 Apr 2012 at 4:34
I have a multipage tiff file, I want that every pages will be OCRd by
tesseract.dll. I am using tesseract.dll 3.01 and C#.net. I have also done the
following,
_ocrProcessor.SetVariable("tessedit_page_number", "-1");
but the dll always OCRd the first page.....please help
Original comment by subhajit...@gmail.com
on 5 Oct 2012 at 12:56
Original issue reported on code.google.com by
curtisjo...@gmail.com
on 15 Mar 2012 at 2:19