tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

The role of underscore in the PDF file name #35

Closed tha-uzhavan closed 8 years ago

tha-uzhavan commented 8 years ago

The first step of 'do_ocr.py' is downloading the pdf. When i put underscore in the PDF-file name (in the space of the file), it skips the download. For example,
Humorous Essays (starts download) Humorous_Essays(skips download). So, before splitting the pdf file, placing the underscore in the PDF-file is necessary. http://shahriar.svbtle.com/underscores-in-python https://justalittlebrain.wordpress.com/2008/10/19/replace-space-with-undescore-in-filename/ I hope you will add this feature.

tshrinivasan commented 8 years ago

Are you adding the _ in file names manually?

It should download whatever the original file url is.

Explain more on this.

What is the requirement?

tha-uzhavan commented 8 years ago

Usually we never write underscore, in the spaces of a file name. but i am writing manually the underscore in the spaces to avoid the download by do_ocr.py script. that why i am asking, before run the do_ocr.py, a script should write the underscores to the pdf available in the OCR4wikiscource folder.

tshrinivasan commented 8 years ago

I want to test this.

Can you give a url of book which has space on its name?

tha-uzhavan commented 8 years ago

http://tamilvu.org/library/nationalized/pdf/17-kagovindan/aariyurkumurpattatamilpanpadu.pdf

ஆரியர்க்கு முற்பட்ட தமிழ்ப்பண்பாடு http://tamilvu.org/library/nationalized/pdf/17-kagovindan/aariyurkumurpattatamilpanpadu.pdf

ஏறத்தாழ 18மெகாபைட்டுகள் உள்ள கோப்பு த.இ.க.க. தளத்தில் பதிவிறக்கி மேற்கண்ட பெயரை முதலில் ஒட்டிக் கொள்ளவும். பிறகு சோதிக்கவும். ஆவலுடன்..

-தகவலுழவன் Wikimedia-User-Name:* Info-farmer Mobile:+91 9095343342*

On Wed, Feb 10, 2016 at 6:08 PM, Shrinivasan T notifications@github.com wrote:

I want to test this.

Can you give a url of book which has space on its name?

— Reply to this email directly or view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/35#issuecomment-182353388 .

tshrinivasan commented 8 years ago

The given pdf has no space or _ .

Give a url of the book that is already uploaded to commons, so that i can give it in file_url in config.ini and test.

jayantanth commented 8 years ago

We have already use so many files where space in a file name, there was no issue. I have not check with _ name.

tha-uzhavan commented 8 years ago

https://commons.wikimedia.org/wiki/Category:The_PDF_files_in_Tamil_without_OCR_conversion

இதுவரை பொதுவக_த்தி_ல் பதிவேற்றி, எழுத்துணரி செய்யா கோப்புகள் இதில் உள்ளன.

நான் பதிவேற்றம் செய்யும் முன்பே, பல கோப்புகளுக்கு underscore போட்டே பதிவேற்றம் செய்துள்ளேன். எனவே, நீங்கள் பதிவிறக்கம் செய்து அதிலுள்ள underscore குறியீட்டை நீக்கி சோதனை செய்யவும். நாம் அக்குறியீடு போட்டு பதிவேற்றம் செய்தாலும், செய்யாவிட்டாலும் பொதுவகத்தின் நிரல், தானாகவே அக்குறியீடு போட்டுக் கொள்ளும் என்பது குறிப்பிடதகுந்த செய்தி ஆகும்.

நம்மிடம் இருக்கும் கோப்பில் அக்குறியீடு இல்லாமல் இருந்தாலும், கூகுள் எழுத்துணரி வேலை செய்ய வேண்டும். ஆனால், அக்குறியீடு இல்லாமல் இருந்தால் படிப்பதற்கு வசதியாக இருக்கும் என இரவி எண்ணுகிறார்.

-தகவலுழவன் Wikimedia-User-Name:* Info-farmer Mobile:+91 9095343342*

On Wed, Feb 10, 2016 at 6:53 PM, Shrinivasan T notifications@github.com wrote:

The given pdf has no space or _ .

Give a url of the book that is already uploaded to commons, so that i can give it in file_url in config.ini and test.

— Reply to this email directly or view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/35#issuecomment-182370876 .

tshrinivasan commented 8 years ago

upload file just with space and give its url.(without _ )

ravidreams commented 8 years ago

Let me explain what happened.

Usually, we have spaces between words in filenames. If we upload as is, Commons automatically adds between words in file uploaded URL. If you download the uploaded file, you will get the filename with between words.

Info-farmer found that by ourselves uploading files with _ between words and then having the file in local folder, made him skip the file download process when do_OCR is run. Since he felt this makes the process faster, he recommended this.

But, this is needless intervention with the regular naming convention. This also had unintended side effects like where the File title came with underscores and then we had to fix that.

https://github.com/tshrinivasan/tools-for-wiki/issues/8

I have requested Info-farmer to just upload files with spaces between words and then let the tool download when do_OCR is run. Too much of customization may not be good.

tshrinivasan commented 8 years ago

Thanks.

Closing this for now.