lisamelton / video_transcoding

Tools to transcode, inspect and convert videos.
MIT License
2.39k stars 160 forks source link

popen handbrake_commend in text mode #263

Closed joshstaiger closed 5 years ago

joshstaiger commented 5 years ago

This fixes an issue with printing multi-byte UTF-8 characters to the console on Windows.

Basically, printing a bare UTF-8 continuation byte ("\xC3"), without any following bytes, to the Windows console raises an error in ruby:

ruby -e 's="\xc3"; print s'

Traceback (most recent call last):
        2: from -e:1:in `<main>'
        1: from -e:1:in `print'
-e:1:in `write': Input/output error @ io_write - <STDOUT> (Errno::EIO)

While this works:

ruby -e 's="\xc3\xb1"; print s' ñ

(Note this doesn't seem to be a problem on Unix platforms — at least not Linux/Mac).

In the transcoding script, we're currently reading output from HandbrakeCLI in binary mode one byte at a time and printing it to the console one byte at a time. Therefore if we encounter a multi-byte UTF-8 character, we have this problem.

By popen-ing the HandbrakeCLI in text mode, each_char iterates over multi-byte characters atomically, rather than iterating over each byte in the multi-byte sequence, thus avoiding the problem.

I'm not sure if you had a good reason for getting the HandbrakeCLI output in binary mode to begin with, so feel free to reject / go with an alternative fix as appropriate :)

lisamelton commented 5 years ago

@joshstaiger Thanks for submitting this patch! Nice detective work, sir. I'm glad it fixes the problem you were seeing and I don't dispute your logic. But... :)

...Two years ago @samhutchins changed that same code to use binary file mode when reading and writing console and log output from HandBrakeCLI. This eliminated redundant information and "console spew" on Windows by suppressing the EOL <-> CRLF conversion.

You can see that change here: https://github.com/donmelton/video_transcoding/commit/2859a278ed5fab2c2abd2b48a8210a14d9effef8

So, I want @samhutchins to review your patch to make sure that taking this won't cause a regression.

Also, we're using binary file mode now with all the other tools and all over the code so we need to think about whether we need to make changes elsewhere.

Sam?

joshstaiger commented 5 years ago

Hmm, interesting. I'm running the transcode script under wsl in Windows. In that case I'd assume that no EOL conversion is necessary - which might explain why I'm not seeing any "console spew". I don't have a regular Windows install outside wsl handy to test otherwise.

I had thought that maybe setting STDOUT to binary mode would fix the problem, but I'm still seeing the issue while writing to the console even if I do that (at least in ruby 2.5.1).

Alternatively, you could keep binary mode and specifically check for UTF-8 multi-byte prefix codes, buffering print output accordingly. eg: throw bytes into a buffer while their high-order bit is set before printing.

samhutchins commented 5 years ago

I'll test on a non WSL installation soon

samhutchins commented 5 years ago

Yeah, this re-introduces the console spew

I think the failure on multi-byte characters is worse than the spew, so I'd be fine with reverting my original change. In an ideal world, we'd buffer the output to get the best of all worlds

samhutchins commented 5 years ago

Also worth noting, there were several commits to change the read mode that would need reverting

lisamelton commented 5 years ago

@samhutchins Thanks for testing so quickly, Sam!

Yeah, spew is preferable to a crash. I wish we could avoid both but buffering, as @joshstaiger suggests, is lot harder than it looks because of the way HandBrakeCLI overwrites lines to update progress.

Of course, if anyone wants to submit a working patch for that, I will gladly review it! :)

lisamelton commented 5 years ago

@samhutchins ...

Also worth noting, there were several commits to change the read mode that would need reverting

Yeah, the changes to binary file mode are all over the code with several tools. Should we change all of them? And do you want me to work up a patch for that after I merge @joshstaiger's pull request?

samhutchins commented 5 years ago

We should be consistent, so I think we should undo the binary mode everywhere. I’ll have a go at a buffering solution sometime soon

samhutchins commented 5 years ago

Or @joshstaiger could do it, lol

joshstaiger commented 5 years ago

I just threw together https://github.com/donmelton/video_transcoding/pull/264, which seems to fix the crash without having to revert the binary mode.

There are probably some edge cases where it might drop a multi byte character if it occurs at the very end of the stream, but I'm not sure that's worth worrying about.

lisamelton commented 5 years ago

@joshstaiger Actually, it's the end-of-stream characters that were the problem for me dealing with the progress overwrite issue. But I'll try your patch out later this evening when I'm back at my development machine.

lisamelton commented 5 years ago

@joshstaiger OK, since I just took your patch from #264, I'm obviously going to reject this patch. But, boy, am I glad you submitted this in the first place! :)

Thanks, again!