tvn-cosine / tesseract.net

a .net wrapper for Tesseract
GNU General Public License v3.0
24 stars 13 forks source link

PDF render #33

Open levpius opened 5 years ago

levpius commented 5 years ago

It takes approximately (<)3 times more time to generate searchable PDF compared to an implementation in tessearct 3 Using the code similar to what is there in wiki

             PageSegmentationMode psm =  PageSegmentationMode.AUTO_OSD;
            TessBaseAPI.SetPageSegMode(psm);
            using (var pix = TessBaseAPI.SetImage(imageFilePath))
            {
                pix.pixDeskew(0);
                TessBaseAPI.Recognize();
                //ensure input name is set
                TessBaseAPI.SetInputName(imageFilePath);
                string tessDataPath = TessBaseAPI.GetDatapath();
                using (var pdfRenderer =
                    new PdfRenderer(destinationPdfFilePathWithoutExt, tessDataPath,
                        false))
                {
                    pdfRenderer.BeginDocument(destinationPdfFileNameWithoutExt);
                    pdfRenderer.AddImage(TessBaseAPI);
                    pdfRenderer.EndDocument();
                }
            }

Since this is very much apparent to the use, please let us know if we are doing anything wrong.