Support multi-byte chars piped via stdin

The issue

This PR fixes an issue where rmate sometimes cuts off a few characters at the tail end of the standard input stream.

The issue comes up when the shell supports multi-byte characters, e. g. LC_ALL=en_US.UTF-8, and when a stream of such multi-byte characters is piped into rmate.

Steps to reproduce

Take the 12-character string:

¡Hola mundo!

In a UTF-8 environment, that 12-character string is 13 bytes long:

$ LC_ALL=en_US.UTF-8
$ printf '¡Hola mundo!' | wc -cm
     12      13

The issue comes up when you pipe ¡Hola mundo! into rmate using a shell which supports multi-byte characters:

$ LC_ALL=en_US.UTF-8
$ printf '¡Hola mundo!' | rmate

This opens a new TextMate window, which contains the following text:

¡Hola mundo

Note the missing ! character.

Cause

The root cause of the issue is related to how rmate handles its socket protocol internally. Both client and server code care only about raw bytes, which is perfectly fine – as long as it happens consistently.

The same protocol also relies on a data: keyword, whose purpose is to convey the byte length of the payload:

[…]
if(key == "data")
{
      bytesLeft = strtol(value.c_str(), NULL, 10);
      size_t dataLen = std::min((ssize_t)line.size(), bytesLeft);
      D(DBF_RMateServer, bug("Got data of size %zd (%zu in this packet)\n", bytesLeft, dataLen););
      records.back().accept_data(line.data(), line.data() + dataLen);
      line.erase(line.begin(), line.begin() + dataLen);
      bytesLeft -= dataLen;

      state = bytesLeft == 0 ? arguments : data;
}
[…]

This is where the inconsistency comes in: the rmate client is supposed to reveal the byte length of the payload at this point; actually, it gives the character length instead. Depending on which external character encoding the Ruby runtime assumes, the resulting character length is not necessarily equal to the byte length.

Ruby mostly chooses a external character encoding that fits the locale setting of the shell environment. If it’s LC_ALL=C, Ruby maps it to Encoding::US_ASCII; in other cases, Ruby loves to choose a multi-byte encoding as its external encoding, which triggers the bug.

Fix

The fix is to use String#bytesize instead of String#size.

String#size returns the number of characters, which, in the above scenario, would not match what the protocol expects.

String#bytesize returns exactly what it says on the tin.

Testing

I have double-checked the fix and can confirm that it works in the following environments:

Terminal.app using Ruby 1.9.3,
Terminal.app using Ruby 2.2.5,
Terminal.app using Ruby 2.3.1,
Terminal.app SSH’ed into a remote GNU bash 4.3.11,
the latter also with a 100 MB log file piped into rmate (bad idea in hindsight),
typing directly into a remote terminal on a Ubuntu 14.04 machine (GNU bash) over a TeamViewer connection,
typing directly into a remote terminal on a local Ubuntu VM using VMware Fusion,
all of the above with LC_ALL set to either C or en_US.UTF-8.

textmate / rmate