nainiayoub / pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents
https://share.streamlit.io/nainiayoub/pdf-text-data-extractor/main/app.py
80 stars 48 forks source link

how to add new lang + how do you put it on web without html,css,js ? #6

Open Artinnavidgoli opened 1 year ago

NL-TCH commented 4 days ago

oke hear me out:

  1. Add the language with the correct abbreviation in the app.py, for example dutch language: 'Dutch': 'nld', is added on line 28. you can find the correct combinations at https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
  2. Then download the datafile for the language (dutch in my case) from the following github page: https://github.com/tesseract-ocr/tessdata/blob/4.1.0/nld.traineddata and put it in /usr/share/tesseract/tessdata:
ls /usr/share/tesseract/tessdata/
configs  eng.traineddata  nld.traineddata  tessconfigs

Done :)