ruediger / VobSub2SRT

Converts VobSub subtitles (.idx/.srt format) into .srt subtitles.
GNU General Public License v3.0
293 stars 65 forks source link

doesn't work with Tesseract 4 ? #67

Open Seegras opened 6 years ago

Seegras commented 6 years ago

It looks like it can't cope with tesseract 4's language data files:

open("/usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=4113088, ...}) = 0 read(3, "\30\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377"..., 4096) = 4096 write(2, "Failed loading language 'eng'\n", 30Failed loading language 'eng' ) = 30 write(2, "Tesseract couldn't load any lang"..., 39Tesseract couldn't load any languages!

Seegras commented 6 years ago

I'm pretty sure now this is the case. This is what happens when trying to compile it.

-- Performing Test TESSERACT_NAMESPACE - Failed CMake Warning at CMakeModules/FindTesseract.cmake:56 (message): You are using an old Tesseract version. Support for Tesseract 2 is deprecated and will be removed in the future! Call Stack (most recent call first): CMakeLists.txt:66 (find_package)

Yes, that's tesseract 4 that's getting misindentified as tesseract 2.

GustavoLafava commented 6 years ago

Simply add -std=gnu++11 to CMAKE_CXX_FLAGS in CMakeLists.txt.

Seegras commented 6 years ago

-- Build type: Debug CMake Warning at CMakeModules/FindTesseract.cmake:56 (message): You are using an old Tesseract version. Support for Tesseract 2 is deprecated and will be removed in the future!

It still misindentifies tesseract 4 as tesseract 2, and compilation still fails, but for some other (or maybe exactly the same) reason:

[ 60%] Building CXX object src/CMakeFiles/vobsub2srt.dir/vobsub2srt.c++.o /home/user/git/VobSub2SRT/src/vobsub2srt.c++: In function ‘int main(int, char)’: /home/user/git/VobSub2SRT/src/vobsub2srt.c++:218:3: error: ‘TessBaseAPI’ has not been declared TessBaseAPI::SimpleInit(tess_path, tess_lang, false); // TODO params ^~~ /home/user/git/VobSub2SRT/src/vobsub2srt.c++:220:5: error: ‘TessBaseAPI’ has not been declared TessBaseAPI::SetVariable("tessedit_char_blacklist", blacklist.c_str()); ^~~ /home/user/git/VobSub2SRT/src/vobsub2srt.c++:275:20: error: ‘TessBaseAPI’ has not been declared char text = TessBaseAPI::TesseractRect(image, 1, stride, 0, 0, width, height); ^~~ /home/user/git/VobSub2SRT/src/vobsub2srt.c++:314:3: error: ‘TessBaseAPI’ has not been declared TessBaseAPI::End(); ^~~ make[2]: [src/CMakeFiles/vobsub2srt.dir/build.make:63: src/CMakeFiles/vobsub2srt.dir/vobsub2srt.c++.o] Error 1 make[1]: *** [CMakeFiles/Makefile2:173: src/CMakeFiles/vobsub2srt.dir/all] Error 2

bubonic commented 5 years ago

Simply add -std=gnu++11 to CMAKE_CXX_FLAGS in CMakeLists.txt.

Had same issue as OP and this worked for me after adding it and doing a make distclean

olaquetal commented 5 years ago

Simply add -std=gnu++11 to CMAKE_CXX_FLAGS in CMakeLists.txt.

Had same issue as OP and this worked for me after adding it and doing a make distclean

Works great ! thanks a lot

marshalleq commented 5 years ago

Can this not be added to the CMakeLists.txt file here, so we don't have to add this manually?

bubonic commented 5 years ago

Can this not be added to the CMakeLists.txt file here, so we don't have to add this manually?

There hasn't been an update on this git in several years. I cloned the repository and even added a few changes to the code at my own git site. I noticed with tesseract 4, different OEM engines produced significantly different results on subtitle files. I experimented with this and added a --tesseract-oem option to vobsub2srt in my git repository. I updated the README with what expectations you should have with various oem options. Play around with it a bit and test for yourself. Anyway, you can find the new VobSub2SRT git repository here:

https://github.com/bubonic/VobSub2SRT

I am in no way currently maintaining this project. It was just a self added add-on. I'll update my repository as seems fit and neccessary.

jefro108 commented 3 years ago

There hasn't been an update on this git in several years. I cloned the repository and even added a few changes to the code at my own git site. I noticed with tesseract 4, different OEM engines produced significantly different results on subtitle files. I experimented with this and added a --tesseract-oem option to vobsub2srt in my git repository. I updated the README with what expectations you should have with various oem options. Play around with it a bit and test for yourself. Anyway, you can find the new VobSub2SRT git repository here:

https://github.com/bubonic/VobSub2SRT

@bubonic I added a fork of your repo to homebrew:

brew tap sammys/VobSub2SRT https://github.com/sammys/VobSub2SRT

and installed it by:

wget https://github.com/sammys/VobSub2SRT/raw/master/packaging/vobsub2srt.rb

brew install --HEAD vobsub2srt.rb

Not sure I needed the brew tap though

bubonic commented 3 years ago

@jefro108 I'm unsure how brew works, as I've never owned an Apple. I just cloned my git repo and compiled from source and everything was working as expected. Glad to hear you found a fork that works. Hope it benefits you! Best.

trufanov-nok commented 2 years ago

@bubonic I'm trying to use your fork but getting

$ vobsub2srt --tesseract-oem 0 video
Error: Tesseract (legacy) engine requested, but components are not present in /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Failed to initialize tesseract (OCR).

Btw, could you open the Issues page for your github project?

bubonic commented 2 years ago

@bubonic I'm trying to use your fork but getting

$ vobsub2srt --tesseract-oem 0 video
Error: Tesseract (legacy) engine requested, but components are not present in /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Failed to initialize tesseract (OCR).

Btw, could you open the Issues page for your github project?

You need these guys installed: https://tesseract-ocr.github.io/tessdoc/Data-Files

Also, Issue page is now open