Closed Shreeshrii closed 6 years ago
Most of the PDF files are from commons.wikimedia.org only where they are PDF only.
I dont get the usecase of getting files from archive.org
Are you using OCR for the files in archive only? not from commons?
Share more inputs.
Today I was trying just to OCR a file from archive.org (without upload to wikipedia).
I just saw https://github.com/tshrinivasan/google-ocr-python That might be a good candidate to try too.
I dont get the usecase of getting files from archive.org
In that case, this is low priority. Closing issue.
Right now split pages of pdf are being copied to new name of pdf and then these are being converted to jpg. I think the intermediate copy step can be eliminated. It will speed up process for large files.
Also, many files on archive.org already have a jpg version of file. eg see https://archive.org/download/MudgalaPuranaPothiOrOblong
https://ia802801.us.archive.org/zipview.php?zip=/23/items/MudgalaPuranaPothiOrOblong/Mudgala%20Purana%20(Pothi%20or%20Oblong)_jp2.zip
https://ia802801.us.archive.org/zipview.php?zip=/23/items/MudgalaPuranaPothiOrOblong/Mudgala%20Purana%20%28Pothi%20or%20Oblong%29_jp2.zip&file=Mudgala%20Purana%20%28Pothi%20or%20Oblong%29_jp2%2FMudgala%20Purana%20%28Pothi%20or%20Oblong%29_0000.jp2&ext=jpg
So, it might be possible to download the jpg files directly.