tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Spaces in input pdf filename not handled correctly #102

Open Shreeshrii opened 6 years ago

Shreeshrii commented 6 years ago

While testing do_ocr_jpg.py v2 I came across a problem related to spaces in the original file name.

I made the following changes to copy statement.

command = "cp *.txt 'text-for-"+ original_filename + "'"
logger.info("Making a copy of all text files to 'text-for-"+ original_filename + "'")

The file I tested with:

https://ia800107.us.archive.org/3/items/Hanuman_Chalisa/Hanuman%20Chalisa.pdf

It is a 2 page pdf in Devanagari script.

Shreeshrii commented 6 years ago

With version 3 of script

Moving all temp files to OCR-Hanuman Chalisa.pdf-temp-2018-05-21-08-05-14

INFO:__main__:Running mv folder*.log currentfile.pdf  doc_data.txt pg*.pdf page* txt* *.jpg  "OCR-Hanuman Chalisa.pdf-temp-2018-05-21-08-05-14"
mv: cannot stat ‘page_00001.jpg’: No such file or directory
mv: cannot stat ‘page_00002.jpg’: No such file or directory
INFO:__main__:Merged all OCRed files to  all_text_for_Hanuman Chalisa.pdf.txt
INFO:__main__:Making a copy of all text files to text-for-Hanuman Chalisa.pdf
INFO:__main__:Running cp *.txt text-for-Hanuman Chalisa.pdf
cp: target ‘Chalisa.pdf’ is not a directory

The output folders are not created. All files stay in the main directory.

tshrinivasan commented 6 years ago

Thanks

Will do few more test on this and add to the code.

Shreeshrii commented 6 years ago

Errors from another file - with v2 of script

mv: cannot stat 'page_01087.jpg': No such file or directory
mv: cannot stat 'page_01088.jpg': No such file or directory
mv: cannot stat 'page_01089.jpg': No such file or directory
INFO:__main__:Merged all OCRed files to  all_text_for_Mudgala Purana (Pothi or Oblong).pdf.txt
INFO:__main__:Making a copy of all text files to text-for-Mudgala Purana (Pothi or Oblong).pdf
INFO:__main__:Running cp *.txt text-for-Mudgala Purana (Pothi or Oblong).pdf
sh: 1: Syntax error: "(" unexpected
INFO:__main__:

Done. Check the text files start with text_for_page_

Edit: Looks likesh: 1: Syntax error: "(" unexpected has been reported previously also.