feature request: dump only failed images.

fundies commented 10 years ago

ERROR: OCR failed for 1 ERROR: OCR failed for 23 ERROR: OCR failed for 133 ERROR: OCR failed for 367 ERROR: OCR failed for 367 ERROR: OCR failed for 386

Can you make an argumen to dump only the images that failed to ocr? And if possible allow them to be opened in external image editor so I can be prompted on the cli for a fix?

ruediger commented 10 years ago

I implemented this option in a branch for now. I'm not sure if it is really needed. I think opening them in an external image editor should be done in a GUI which could also use dump-images and simply parse the error output to get to the images.

Eventually I want to figure out how to extract more information from tesseract about the OCR process. It should provide some kind of confidence or error estimate. That would probably be even more useful than simply looking for images with complete OCR failure.

fundies commented 10 years ago

Cloning into 'VobSub2SRT'... done. ==> Starting pkgver()... ==> Updated version: vobsub2srt-git v1.0pre6.36.gde90184-1 ==> Starting build()... Switched to a new branch 'origin/dump-error-images' -- The C compiler identification is GNU 4.8.2

Maybe I did something wrong but. [greg@greg-desktop test]$ vobsub2srt --dump-error-images vobsub ERROR: OCR failed for 1 ERROR: OCR failed for 23 ERROR: OCR failed for 133 ERROR: OCR failed for 367 ERROR: OCR failed for 367 ERROR: OCR failed for 386 Wrote Subtitles to 'vobsub.srt' [greg@greg-desktop test]$ ls vobsub.idx vobsub.srt vobsub.sub

As you can see there are no images. Also I wanted a prompt because vobsub2srt deletes the line it can't ocr then shifts all the timecoodes. So not only do I have to figure out the ocr. I got to manually open the idx and get the time codes too then edit the line in. It's kinda annoying :P

ruediger commented 10 years ago

Could you provide me with a sample file? (e.g., via e-mail ruediger@c-plusplus.de) I have several VobSub samples but none for which OCR fails.

Also I wanted a prompt because vobsub2srt deletes the line it can't ocr then shifts all the timecoodes.

shifts all the timecodes? That's strange. I'll have to test it. I guess the best way would be to write the error message to the SRT as well. That way a GUI tool could easily point to the part of the SRT that needs fixing.

fundies commented 10 years ago

I sent them. By shift imecodes. I mean. It completly deletes the empty line. Ie if it were line 21 itd make line 22 become line 21.

ruediger commented 10 years ago

hmm works for me.

$ ../build/bin/vobsub2srt --dump-error-images error-vobsub
ERROR: OCR failed for 1
ERROR: OCR failed for 23
ERROR: OCR failed for 133
ERROR: OCR failed for 367
ERROR: OCR failed for 367
ERROR: OCR failed for 386
Wrote Subtitles to 'error-vobsub.srt'
$ ls error-vobsub*
error-vobsub-001.pgm  error-vobsub-023.pgm  error-vobsub-133.pgm  error-vobsub-367.pgm
error-vobsub-386.pgm  error-vobsub.idx  error-vobsub.srt  error-vobsub.sub

maybe you are calling an old version of vobsub2srt or haven't rebuild it properly.

ruediger commented 10 years ago

b70b6f584e8151f70f9d90df054af0911ea7475e should fix the shifting problem and writes an error message to the SRT in case of OCR error.

Thanks for reporting that issue and providing me with the sample subtitles.

fundies commented 10 years ago

I got it but cant for the life of me figure out whats needed to open a pmg... Nothing I try can view it

ruediger commented 10 years ago

PGM is a rather simple format. What operating system are you using? On Linux you should enter xdg-open filename.pgm and it should open an appropriate image viewer if one is installed (e.g., KDE's gewnview). You can also simply convert it into a different format if you have ImageMagick installed: convert filename.pgm filename.png should simply convert it to PNG.

https://en.wikipedia.org/wiki/Portable_pixmap

fundies commented 10 years ago

Hmm It appears the Images It fails to ocr are corrupt? I can open the rest just fine :/

ruediger commented 10 years ago

Ah, ok. I was surprised that tesseract would simply return NULL for an OCR error but in fact it seems to be an error with the bitmap data. It seems the subtitle has a height of 0. Are those subtitles displayed when you watch them with MPlayer? Do they contain actual text?

fundies commented 10 years ago

most of them are nothing but ocasionally its a line :/. Watching in mplayer everything displays fine

ruediger commented 10 years ago

ah, that's bad. Because it means the problem is not in the mplayer code but how I call the mplayer code. This will probably take a while for me to figure it out. Are these the only subtitles you have with errors? They are only 6 frames with error so I guess you can work around that for now.

Sorry about that.

julien-nc commented 9 years ago

Hi, thanks for the work. Amazing tool to get rid of vobsub.

I had two or three missing lines on the sub i processed. I spotted them when i watched the movie. I compared with the vobsub to make sure there was a miss.

My problem is that those mistakes are not detected/signaled during process even with the --dump-error-images option. The missing lines don't let any clue in the srt file. There is apparently no way to detect those errors except watching the whole movie. Do i miss something ?

If one day you feel you want to attack this issue, here are my files : vobsub : http://pluton.cassio.pe/~demo/manhunter.idx http://pluton.cassio.pe/~demo/manhunter.sub and result : http://pluton.cassio.pe/~demo/manhunter.srt

You'll need french tesseract data (tesseract-ocr-fra package in ubuntu)

One miss is between 753 and 754, at 01:04:42 .

ruediger / VobSub2SRT

feature request: dump only failed images. #34