tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Consider using jpg files directly from archive.org #103

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 6 years ago

Right now split pages of pdf are being copied to new name of pdf and then these are being converted to jpg. I think the intermediate copy step can be eliminated. It will speed up process for large files.

Also, many files on archive.org already have a jpg version of file. eg see https://archive.org/download/MudgalaPuranaPothiOrOblong

https://ia802801.us.archive.org/zipview.php?zip=/23/items/MudgalaPuranaPothiOrOblong/Mudgala%20Purana%20(Pothi%20or%20Oblong)_jp2.zip

https://ia802801.us.archive.org/zipview.php?zip=/23/items/MudgalaPuranaPothiOrOblong/Mudgala%20Purana%20%28Pothi%20or%20Oblong%29_jp2.zip&file=Mudgala%20Purana%20%28Pothi%20or%20Oblong%29_jp2%2FMudgala%20Purana%20%28Pothi%20or%20Oblong%29_0000.jp2&ext=jpg

So, it might be possible to download the jpg files directly.

tshrinivasan commented 6 years ago

Most of the PDF files are from commons.wikimedia.org only where they are PDF only.

I dont get the usecase of getting files from archive.org

Are you using OCR for the files in archive only? not from commons?

Share more inputs.

Shreeshrii commented 6 years ago

Today I was trying just to OCR a file from archive.org (without upload to wikipedia).

I just saw https://github.com/tshrinivasan/google-ocr-python That might be a good candidate to try too.

Shreeshrii commented 6 years ago

I dont get the usecase of getting files from archive.org

In that case, this is low priority. Closing issue.