pip install multilingual-pdf2text
The library uses Tesseract which can be installed by following instructions:
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)
def main():
pdf_document = Document(
document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
language='spa'
)
pdf2text = PDF2Text(document=pdf_document)
content = pdf2text.extract()
print(content)
if name == "main": main()
Tesseract supports the following languages: <br/>
Code Language
* afr Afrikaans
* amh Amharic
* ara Arabic
* asm Assamese
* aze Azerbaijani
* aze_cyrl Azerbaijani - Cyrillic aze_
* bel Belarusian
* ben Bengali
* bod Tibetan
* bos Bosnian
* bul Bulgarian
* cat Catalan; Valencian
* ceb Cebuano
* ces Czech
* chi_sim Chinese - Simplified chi_
* chi_tra Chinese - Traditional chi_
* chr Cherokee
* cym Welsh
* dan Danish
* deu German
* dzo Dzongkha
* ell Greek, Modern (1453-)
* eng English
* enm English, Middle (1100-1500)
* epo Esperanto
* est Estonian
* eus Basque
* fas Persian
* fin Finnish
* fra French
* frk German Fraktur
* frm French, Middle (ca. 1400-1600)
* gle Irish
* glg Galician
* grc Greek, Ancient (-1453)
* guj Gujarati
* hat Haitian; Haitian Creole
* heb Hebrew
* hin Hindi
* hrv Croatian
* hun Hungarian
* iku Inuktitut
* ind Indonesian
* isl Icelandic
* ita Italian
* ita_old Italian - Old ita_
* jav Javanese
* jpn Japanese
* kan Kannada
* kat Georgian
* kat_old Georgian - Old kat_
* kaz Kazakh
* khm Central Khmer
* kir Kirghiz; Kyrgyz
* kor Korean
* kur Kurdish
* lao Lao
* lat Latin
* lav Latvian
* lit Lithuanian
* mal Malayalam
* mar Marathi
* mkd Macedonian
* mlt Maltese
* msa Malay
* mya Burmese
* nep Nepali
* nld Dutch; Flemish
* nor Norwegian
* ori Oriya
* pan Panjabi; Punjabi
* pol Polish
* por Portuguese
* pus Pushto; Pashto
* ron Romanian; Moldavian; Moldovan
* rus Russian
* san Sanskrit
* sin Sinhala; Sinhalese
* slk Slovak
* slv Slovenian
* spa Spanish; Castilian
* spa_old Spanish; Castilian - Old spa_
* sqi Albanian
* srp Serbian
* srp_latn Serbian - Latin srp_
* swa Swahili
* swe Swedish
* syr Syriac
* tam Tamil
* tel Telugu
* tgk Tajik
* tgl Tagalog
* tha Thai
* tir Tigrinya
* tur Turkish
* uig Uighur; Uyghur
* ukr Ukrainian
* urd Urdu
* uzb Uzbek
* uzb_cyrl Uzbek - Cyrillic uzb_
* vie Vietnamese
* yid Yiddish