otiai10 / gosseract

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library
https://pkg.go.dev/github.com/otiai10/gosseract
MIT License
2.65k stars 286 forks source link

Heroku tesseract build pack support? #123

Open ansonl opened 6 years ago

ansonl commented 6 years ago

Is there a working configuration to get this working with one of the Heroku tesseract buildpacks such as https://github.com/Dkevs/heroku-buildpack-tesseract?

When compiling go app, Heroku gives error

tessbridge.cpp:5:31: fatal error: tesseract/baseapi.h: No such file or directory
remote: compilation terminated.

I've tried setting CGO_CFLAGS in heroku like heroku config:set CGO_CFLAGS='-I ${build_dir}/tesseract/../' to no avail.

I see the example heroku project uses docker and installs libtesseract-dev. Wondering if gosseract is only tested with docker and if you can recommend a buildpack for libtesseract-dev.

otiai10 commented 6 years ago

Hi, @ansonl

First, have you tried LD_LIBRARY_PATH?

I personally recommend using Docker for your heroku application because it's more flexible and easy to handle.

I'm gonna try buildpack when I have a time.

ansonl commented 6 years ago

Unfortunately I wasn't able to get the libtesseract-dev buildpack working. I ended up just calling the tesseract command through os.Exec for a project that uses tesseract a couple hundred times.

This also confirmed what seems to be a memory leak issue in the Tesseract BaseAPI.End() function. When calling the End() function and letting client go out of scope, memory usage decreases slightly, but still takes up a couple megabytes of memory for each client struct created.

This can be seen by running the below test program:

package main

import (
    "fmt"
    "github.com/otiai10/gosseract"
    "time"
)

func main() {

    var count int

    var clients []*gosseract.Client

    for _ = range time.Tick(time.Millisecond*100) {
        client := gosseract.NewClient()

    client.SetImage("002-confusing.png")
    text, _ := client.Text()
    _, _ = client.HOCRText()
    fmt.Println(text)
    // Hello, World!
    count++

    clients = append(clients, client)

    if count == 20 {
        break
    }
    }

    for _ = range time.Tick(time.Millisecond*10) {
        count--
        if count == 0 {
            break
        }
    }

    for _, c := range clients {
        (*c).Close()
    }

    for _ = range time.Tick(time.Second*1) {
    }
}

For some reason, after the clients are closed and the tesseract BaseAPI End() method should be called, memory usage will remain elevated. I have tried calling the Go garbage collector functions and it seems to make no difference. The only way I have found to release the memory is to exit the program.

I looked through the .cpp file and have not seen any bugs, so this may be a tesseract library issue.

otiai10 commented 6 years ago

@ansonl Thank you. Do you mind separating issues please?