tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

issue - Filename too long #71

Open shabab opened 8 years ago

shabab commented 8 years ago

Issue

Most of the modern filesystem, including ext3 and ext4, has a file / folder name limit which is 255 bytes or we can say 255 ANSI characters. If anyone use data encryption (mostly eCryptfs, Ubuntu default) for their layered architecture, this limit comes done.

Moreover, in indic languages, we use Unicode in stead of ANSI / ASCII. When the character code goes to hexadecimal, in some case we can only use 80-85 unicode chrarecters in practical.

Some of the Books in Wikisource has long names. eg: 'বঙ্গের_জাতীয়ইতিহাস(কায়স্থ_কাণ্ড,_ষষ্ঠাংশ,_দক্ষিণরাঢ়ীয়_কায়স্থ_কাণ্ড,_প্রথম_খণ্ড).djvu'. So, the temp folder name becomes very long with it's prefix 'OCR' and timestamp suffix. When the mkdir tries to make the directory, it throws a error, 'filename too long'.

Possible solution:

I was fiddling around the script and came up an idea of seperating the basename and filename. My proposed solution is as follows.

do_ocr.py:109 basename = os.path.basename(original_url) filename = basename[:80] #limiting the filename if longer that 80 chars

mediawiki_uploader.py:212 pagename = basename.encode('utf-8') + "/" + indic_page_number

This is a very rough idea, but I think you get my point.

Thanks.

tshrinivasan commented 8 years ago

I am also thinking the same, to limit the characters to 80. but, thinking on how to proceed when few books have lengthy common names and "part1, part2, part3, etc" as the suffix.

Share your thoughts.

shabab commented 8 years ago

Yes, the common names are an issue. So, I came up with another idea about adding another config variable for alternative filenames. I added another variable on config.ini named, 'filename_alt' and put there an alternative name for the file without any extension.

Then added a condition in the script to check the filename length. If it exceeds 80 then it will take the 'filename_alt' + filetype as file name, otherwise will work as usual. So, filetype and pagename will come from URL, only name string will come from the config.ini 'filename_alt' variable.

I have tried it with 3 books so far and all worked as expected. I can send you a pull request if you like.

tshrinivasan commented 8 years ago

Great.

Share the example book URLs and do a pull request.

Thanks.