Windows - point not allowed in filename

toninlg commented 10 years ago

Hi,

It seems that points are not allowed in file name. File name is truncated at the first point during the process. For 0.7.6, with a file name cut.1.pdf, the error is "The system cannot find the file specified: 'text_cut_ocr.pdf". File name that are created are cut.1_1.png, cut.1_1.html and cut_ocr.pdf. There is ocr text in output but it seems this is the same for all pages.

Thank you.

rgbyrnes commented 10 years ago

This isn't specific to Windows. Here are some fixes for the bugs that we found:

diff --git a/pypdfocr/pypdfocr_filer.py b/pypdfocr/pypdfocr_filer.py
index 6217f90f..705e48fd 100644
--- a/pypdfocr/pypdfocr_filer.py
+++ b/pypdfocr/pypdfocr_filer.py
@@ -66,10 +66,8 @@ class PyFiler(object):

     def _split_filename_dir_filename_ext(self, filename):
         dr, fn = os.path.split(filename) # Get directory and filename
-        fn_no_ext = fn.split('.')[0:-1] # Get the filename without ending extension
-        fn_no_ext = ''.join(fn_no_ext)
-        ext = fn.split('.')[-1]
-        return dr, fn_no_ext, ext
+        fn_no_ext, ext = os.path.splitext(fn) # Get root and extension
+        return dr, fn_no_ext, ext[1:] # Remove leading '.' from extension

     def get_target_folder(self):
         return self._target_folder

diff --git a/pypdfocr/pypdfocr_pdf.py b/pypdfocr/pypdfocr_pdf.py
index 30f88d2f..4d03f061 100644
--- a/pypdfocr/pypdfocr_pdf.py
+++ b/pypdfocr/pypdfocr_pdf.py
@@ -112,7 +112,7 @@ class PyPdf(object):
             writer.addPage(orig_pg)

         pdf_dir, pdf_basename = os.path.split(orig_pdf_filename)
-        basename = pdf_basename.split('.')[0]
+        basename = os.path.splitext(pdf_basename)[0]
         pdf_filename = os.path.join(pdf_dir, "%s_ocr.pdf" % (basename))
         with open(pdf_filename, 'wb') as f:
             writer.write(f)
@@ -144,7 +144,7 @@ class PyPdf(object):
         logging.debug("hocr_filename:%s, hocr_dir:%s, hocr_basename:%s" % (hocr_filename, hocr_dir, hocr_basename))
         assert(img_dir == hocr_dir)

-        basename = hocr_basename.split('.')[0]
+        basename = os.path.splitext(hocr_basename)[0]
         pdf_filename = os.path.join("text_%s_ocr.pdf" % (basename))

         # Switch to the hocr directory to make this easier
@@ -177,7 +177,7 @@ class PyPdf(object):
     def overlay_hocr_old(self, dpi, hocr_filename):
         hocr_dir, hocr_basename = os.path.split(hocr_filename)
         logging.debug("hocr_filename:%s, hocr_dir:%s, hocr_basename:%s" % (hocr_filename, hocr_dir, hocr_basename))
-        basename = hocr_basename.split('.')[0]
+        basename = os.path.splitext(hocr_basename)[0]
         pdf_filename = os.path.join("%s_ocr.pdf" % (basename))
         text_pdf_filename = pdf_filename + ".tmp"

@@ -227,7 +227,7 @@ class PyPdf(object):
                 pg_num = i+1
                 # Do a quick assert to make sure our sorted page number matches
                 # what's embedded in the filename
-                file_pg_num = int(img_file.split(basename+"_")[1].split('.')[0])
+                file_pg_num = int(os.path.splitext(img_file.split(basename+"_")[1])[0])
                 if file_pg_num != pg_num:
                     logging.warn("Page number from file (%d) does not match iteration (%d)... continuing anyway" % (file_pg_num, pg_num))

virantha commented 10 years ago

Thanks! Let me try to roll these patches into the next release.

virantha commented 10 years ago

Fixed in 0.8.0

toninlg commented 10 years ago

Thanks to both of you, it's working with the 0.8.0. I've just a bunch of warnings but file output is OK.

WARNING: Could not run command convert 'file.name_7.jpg' -respect-parenthesis \( -clone 0 -colorspace gray -neg
ate -lat 15x15+5\% -contrast-stretch 0 \) -compose copy_opacity -composite -opaque none +matte -modulate 100,100 -blur 1x1 -adaptive-sharpen 0x2 -negate -define morp
hology:compose=darken -morphology Thinning Rectangle:1x30+0+0 -negate  'file.name_7_preprocess.jpg'
Invalid Parameter - -respect-parenthesis

virantha commented 10 years ago

Hmm, I'm guessing your ImageMagick version is old, or it's not in your path (that convert utility is from there, and used to clean up the image a bit before OCR). I'll add in a check for version in the next release to clean up that warning.

virantha / pypdfocr

Windows - point not allowed in filename #19