tleyden / open-ocr

Run your own OCR-as-a-Service using Tesseract and Docker
Apache License 2.0
1.33k stars 224 forks source link

Error when trying to create TSV file #103

Open Gurey opened 6 years ago

Gurey commented 6 years ago

Trying to create a TSV gives me an error.

endpoint: localhost:9292/ocr

with payload:

{
  "img_url": "http://bit.ly/ocrimage",
  "engine": "tesseract",
  "engine_args": {
    "config_vars": {
        "tessedit_create_tsv": "1",
        "tessedit_pageseg_mode": "1"
    }
  }
}

Gives me: Error processing image url: . Error: Could not find outfile. Basename: /tmp/785ad93e-2721-4fd1-7892-b78ffc442ae0 Extensions: [txt hocr]

tesseract official TSV config: https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/tsv

tleyden commented 6 years ago

Does it work if you remove the "config_vars"?

The relevant code is here:

https://github.com/tleyden/open-ocr/blob/eb6a93145c41e0af1d96417eb7373fb183c61cc8/tesseract_engine.go#L233-L248

I wonder if tesseract is writing to a file extension that OpenOCR isn't expecting.. looks like OpenOCR looks for .txt or .hocr

Gurey commented 6 years ago

Hi! thank you for getting back to me.

these 2 requests works fine:

{
  "img_url": "http://bit.ly/ocrimage",
  "engine": "tesseract",
  "engine_args": {
    "config_vars": {
        "tessedit_create_hocr": "1",
    "tessedit_pageseg_mode": "1"
    }
  },
  "psm": 3 
}
{
  "img_url": "http://bit.ly/ocrimage",
  "engine": "tesseract",
  "engine_args": {
    "config_vars": {}
  },
  "psm": 3 
}

I just tried to run tesseract in my terminal with the command tesseract pic.jpg out tsv and that gives me a file named out.tsv

tleyden commented 6 years ago

Ok I think the fix is pretty easy, it just needs to look for files with .tsv extension. Are you able to submit a PR?

luzanikita commented 4 years ago

https://github.com/tleyden/open-ocr/pull/126