This PR fixes an issue where rmate sometimes cuts off a few characters at the tail end of the standard input stream.
The issue comes up when the shell supports multi-byte characters, e. g. LC_ALL=en_US.UTF-8, and when a stream of such multi-byte characters is piped into rmate.
Steps to reproduce
Take the 12-character string:
¡Hola mundo!
In a UTF-8 environment, that 12-character string is 13 bytes long:
This opens a new TextMate window, which contains the following text:
¡Hola mundo
Note the missing ! character.
Cause
The root cause of the issue is related to how rmate handles its socket protocol internally. Both client and server code care only about raw bytes, which is perfectly fine – as long as it happens consistently.
[…]
if(key == "data")
{
bytesLeft = strtol(value.c_str(), NULL, 10);
size_t dataLen = std::min((ssize_t)line.size(), bytesLeft);
D(DBF_RMateServer, bug("Got data of size %zd (%zu in this packet)\n", bytesLeft, dataLen););
records.back().accept_data(line.data(), line.data() + dataLen);
line.erase(line.begin(), line.begin() + dataLen);
bytesLeft -= dataLen;
state = bytesLeft == 0 ? arguments : data;
}
[…]
This is where the inconsistency comes in: the rmate client is supposed to reveal the byte length of the payload at this point; actually, it gives the character length instead. Depending on which external character encoding the Ruby runtime assumes, the resulting character length is not necessarily equal to the byte length.
Ruby mostly chooses a external character encoding that fits the locale setting of the shell environment. If it’s LC_ALL=C, Ruby maps it to Encoding::US_ASCII; in other cases, Ruby loves to choose a multi-byte encoding as its external encoding, which triggers the bug.
Fix
The fix is to use String#bytesize instead of String#size.
String#size returns the number of characters, which, in the above scenario, would not match what the protocol expects.
The issue
This PR fixes an issue where
rmate
sometimes cuts off a few characters at the tail end of the standard input stream.The issue comes up when the shell supports multi-byte characters, e. g.
LC_ALL=en_US.UTF-8
, and when a stream of such multi-byte characters is piped intormate
.Steps to reproduce
Take the 12-character string:
In a UTF-8 environment, that 12-character string is 13 bytes long:
The issue comes up when you pipe
¡Hola mundo!
intormate
using a shell which supports multi-byte characters:This opens a new TextMate window, which contains the following text:
Note the missing
!
character.Cause
The root cause of the issue is related to how
rmate
handles its socket protocol internally. Both client and server code care only about raw bytes, which is perfectly fine – as long as it happens consistently.The same protocol also relies on a
data:
keyword, whose purpose is to convey the byte length of the payload:This is where the inconsistency comes in: the
rmate
client is supposed to reveal the byte length of the payload at this point; actually, it gives the character length instead. Depending on which external character encoding the Ruby runtime assumes, the resulting character length is not necessarily equal to the byte length.Ruby mostly chooses a external character encoding that fits the locale setting of the shell environment. If it’s
LC_ALL=C
, Ruby maps it toEncoding::US_ASCII
; in other cases, Ruby loves to choose a multi-byte encoding as its external encoding, which triggers the bug.Fix
The fix is to use
String#bytesize
instead ofString#size
.String#size
returns the number of characters, which, in the above scenario, would not match what the protocol expects.String#bytesize
returns exactly what it says on the tin.Testing
I have double-checked the fix and can confirm that it works in the following environments:
rmate
(bad idea in hindsight),LC_ALL
set to eitherC
oren_US.UTF-8
.