Closed joshstaiger closed 5 years ago
@joshstaiger Thanks for submitting this patch! Nice detective work, sir. I'm glad it fixes the problem you were seeing and I don't dispute your logic. But... :)
...Two years ago @samhutchins changed that same code to use binary file mode when reading and writing console and log output from HandBrakeCLI
. This eliminated redundant information and "console spew" on Windows by suppressing the EOL <-> CRLF conversion.
You can see that change here: https://github.com/donmelton/video_transcoding/commit/2859a278ed5fab2c2abd2b48a8210a14d9effef8
So, I want @samhutchins to review your patch to make sure that taking this won't cause a regression.
Also, we're using binary file mode now with all the other tools and all over the code so we need to think about whether we need to make changes elsewhere.
Sam?
Hmm, interesting. I'm running the transcode script under wsl in Windows. In that case I'd assume that no EOL conversion is necessary - which might explain why I'm not seeing any "console spew". I don't have a regular Windows install outside wsl handy to test otherwise.
I had thought that maybe setting STDOUT to binary mode would fix the problem, but I'm still seeing the issue while writing to the console even if I do that (at least in ruby 2.5.1).
Alternatively, you could keep binary mode and specifically check for UTF-8 multi-byte prefix codes, buffering print output accordingly. eg: throw bytes into a buffer while their high-order bit is set before printing.
I'll test on a non WSL installation soon
Yeah, this re-introduces the console spew
I think the failure on multi-byte characters is worse than the spew, so I'd be fine with reverting my original change. In an ideal world, we'd buffer the output to get the best of all worlds
Also worth noting, there were several commits to change the read mode that would need reverting
@samhutchins Thanks for testing so quickly, Sam!
Yeah, spew is preferable to a crash. I wish we could avoid both but buffering, as @joshstaiger suggests, is lot harder than it looks because of the way HandBrakeCLI
overwrites lines to update progress.
Of course, if anyone wants to submit a working patch for that, I will gladly review it! :)
@samhutchins ...
Also worth noting, there were several commits to change the read mode that would need reverting
Yeah, the changes to binary file mode are all over the code with several tools. Should we change all of them? And do you want me to work up a patch for that after I merge @joshstaiger's pull request?
We should be consistent, so I think we should undo the binary mode everywhere. I’ll have a go at a buffering solution sometime soon
Or @joshstaiger could do it, lol
I just threw together https://github.com/donmelton/video_transcoding/pull/264, which seems to fix the crash without having to revert the binary mode.
There are probably some edge cases where it might drop a multi byte character if it occurs at the very end of the stream, but I'm not sure that's worth worrying about.
@joshstaiger Actually, it's the end-of-stream characters that were the problem for me dealing with the progress overwrite issue. But I'll try your patch out later this evening when I'm back at my development machine.
@joshstaiger OK, since I just took your patch from #264, I'm obviously going to reject this patch. But, boy, am I glad you submitted this in the first place! :)
Thanks, again!
This fixes an issue with printing multi-byte UTF-8 characters to the console on Windows.
Basically, printing a bare UTF-8 continuation byte ("\xC3"), without any following bytes, to the Windows console raises an error in ruby:
ruby -e 's="\xc3"; print s'
While this works:
ruby -e 's="\xc3\xb1"; print s'
ñ
(Note this doesn't seem to be a problem on Unix platforms — at least not Linux/Mac).
In the transcoding script, we're currently reading output from HandbrakeCLI in binary mode one byte at a time and printing it to the console one byte at a time. Therefore if we encounter a multi-byte UTF-8 character, we have this problem.
By popen-ing the HandbrakeCLI in text mode, each_char iterates over multi-byte characters atomically, rather than iterating over each byte in the multi-byte sequence, thus avoiding the problem.
I'm not sure if you had a good reason for getting the HandbrakeCLI output in binary mode to begin with, so feel free to reject / go with an alternative fix as appropriate :)