nextcloud / fulltextsearch

🔍 Core of the full-text search framework for Nextcloud
GNU Affero General Public License v3.0
211 stars 51 forks source link

Tesseract language selection #131

Closed peix2 closed 6 years ago

peix2 commented 7 years ago

Each time I see tesseract running it has -l eng only. Is there any easy way to change it or use all available? Would be good to make a setting for that.

Sanookmakmak commented 7 years ago

Just install the language file for your needs

apt-cache search tesseract
apt-get install tesseract-deu

Then you can select the language before OCRing the file

9a3a5d95c1859bce0b2717b7b0ad3b708529f0f5

peix2 commented 7 years ago

Hi Sanookmakmak,

I have no such menu. My guess this comes with OCR app for nextcloud which I haven't install. I'm using nextant app only and enabled indexing through images as well. So in general nextant admin settings I can not set language and on the server process list, while indexing, I see tesseract is run with "-l eng". How I can change this parameter to anything else if no such setting? Do you know where in code (if possible) I can search and change it?

And finally could you check and confirm what app this menu, shown by you, comes with? Does it impact the way tesseract is run for nextant as well?

Cheers

Px2

W dniu 24.01.2017 o 19:11, Sanookmakmak pisze:

Just install the language file for your needs

|apt-cache search tesseract apt-get install tesseract-deu |

Then you can select the language before OCRing the file

9a3a5d95c1859bce0b2717b7b0ad3b708529f0f5 https://cloud.githubusercontent.com/assets/24833757/22236179/45dc1040-e204-11e6-8f5a-7756e9536c76.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextcloud/nextant/issues/131#issuecomment-274718732, or mute the thread https://github.com/notifications/unsubscribe-auth/AYI4wUG1oBwMdgfYxUXWvf5haO0KDjM9ks5rVZYbgaJpZM4LrtFY.

Sanookmakmak commented 7 years ago

Of course you are right, this menu belongs to the OCR app ;-)

I did a

grep -Ri tesseract /opt/solr

and it found

tika-parsers-1.13.jar

Inside the jar file is the file TesseractOCRConfig.properties with the content

tesseractPath=
language=eng
pageSegMode=1
maxFileSizeToOcr=2147483647
minFileSizeToOcr=0
timeout=120

I reckon this is what you are looking for.

https://wiki.apache.org/tika/TikaOCR

peix2 commented 7 years ago

You're the man.

Thanks!

W dniu 25.01.2017 o 11:43, Sanookmakmak pisze:

Of course you are right, this menu belongs to the OCR app ;-)

I did a

|grep -Ri tesseract /opt/solr|

and it found

|tika-parsers-1.13.jar|

Inside the jar file is the file |TesseractOCRConfig.properties| with the content

|tesseractPath= language=eng pageSegMode=1 maxFileSizeToOcr=2147483647 minFileSizeToOcr=0 timeout=120 |

I reckon this is what you are looking for.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextcloud/nextant/issues/131#issuecomment-274963398, or mute the thread https://github.com/notifications/unsubscribe-auth/AYI4wcVq7-ghKpasuXVBJEZwVcmzCtv9ks5rVn6KgaJpZM4LrtFY.

ArtificialOwl commented 7 years ago

Add this to the wiki, or find a way to integrate the language selection into Nextant

Sanookmakmak commented 7 years ago

Add this to the wiki

https://github.com/nextcloud/nextant/wiki/And-some-more-...#change-ocr-language

Ark74 commented 7 years ago

Hi! My file has tesseractPath= empty, should i set it up to tesseractPath=/usr/bin/tesseract

The TikaOCR doesn't say much about it.

ArtificialOwl commented 7 years ago

from what it seems, Tika will get tesseract if installed at its usual place.

ArtificialOwl commented 6 years ago

Please use Full text search instead of Nextant.