ruediger / VobSub2SRT

Converts VobSub subtitles (.idx/.srt format) into .srt subtitles.
GNU General Public License v3.0
294 stars 67 forks source link

Chinese character error #31

Closed mwestphal closed 11 years ago

mwestphal commented 11 years ago

Hello i've been trying to use vobsub2srt to convert chinese sb to srt. Using the following command : vobsub2srt --lang zh --tesseract-lang chi_sim subtitles

however the conversion is not working well, a lot of character are not recognized correctly, even so the font used in vobsub is perfectly readable.

Here the vobsub screenshot : http://img546.imageshack.us/img546/4601/u1xu.jpg

Here the converted sub : http://img571.imageshack.us/img571/5273/iyx2.jpg

We can easily see that some characted have been simplfied. Here are the sub/idx files. http://www.2shared.com/file/ZvL3xukf/subtitles.html http://www.2shared.com/file/1u7L35fD/subtitles.html

Is this normal? is there a work around ?

capiscuas commented 11 years ago

I guess it's more of a tesseract problem than VobSub2SRT which is just automating the different tools. I suggest you to also send a bug to the tesseract project to see what they tell u.

ruediger commented 11 years ago

The characters look like traditional Chinese writing and you are using chi_sim which is for simplified Chinese characters. Try chi_tra (maybe you need to install it first. E.g., tesseract-ocr-chi-tra on Debian/Ubuntu) instead.

mwestphal commented 11 years ago

@ruediger thx for the suggestion, but it is simplified chinese. i've tried with chi_tra and there is more rubbish.

I will open an issue in the tesseract project.

Anyway the easy work around here is to hardcode the vobsub into the video, but that's not the point.

ruediger commented 11 years ago

Oh, my chinese is a bit too rusty I guess.

Please provide a link to the tesseract bug report. You can also dump images with the --dump-images flag if the tesseract devs ask for a sample.

mwestphal commented 11 years ago

http://code.google.com/p/tesseract-ocr/issues/detail?id=1002&q=chinese

mwestphal commented 11 years ago

Look like i cannot provide the information they need, nor do the test the suggest. If anyone from vosub2srt want to take the lead with this issue, please do it.

ruediger commented 11 years ago

As I said you can extract the subtitle images with --dump-images. Pick one and do the tests using the tesseract command line program.