textmate / rmate

Edit files from an ssh session in TextMate
514 stars 81 forks source link

Support multi-byte chars piped via stdin #50

Closed claui closed 7 years ago

claui commented 7 years ago

The issue

This PR fixes an issue where rmate sometimes cuts off a few characters at the tail end of the standard input stream.

The issue comes up when the shell supports multi-byte characters, e. g. LC_ALL=en_US.UTF-8, and when a stream of such multi-byte characters is piped into rmate.

Steps to reproduce

Take the 12-character string:

¡Hola mundo!

In a UTF-8 environment, that 12-character string is 13 bytes long:

$ LC_ALL=en_US.UTF-8
$ printf '¡Hola mundo!' | wc -cm
     12      13

The issue comes up when you pipe ¡Hola mundo! into rmate using a shell which supports multi-byte characters:

$ LC_ALL=en_US.UTF-8
$ printf '¡Hola mundo!' | rmate

This opens a new TextMate window, which contains the following text:

¡Hola mundo

Note the missing ! character.


The root cause of the issue is related to how rmate handles its socket protocol internally. Both client and server code care only about raw bytes, which is perfectly fine – as long as it happens consistently.

The same protocol also relies on a data: keyword, whose purpose is to convey the byte length of the payload:

if(key == "data")
      bytesLeft = strtol(value.c_str(), NULL, 10);
      size_t dataLen = std::min((ssize_t)line.size(), bytesLeft);
      D(DBF_RMateServer, bug("Got data of size %zd (%zu in this packet)\n", bytesLeft, dataLen););
      records.back().accept_data(line.data(), line.data() + dataLen);
      line.erase(line.begin(), line.begin() + dataLen);
      bytesLeft -= dataLen;

      state = bytesLeft == 0 ? arguments : data;

This is where the inconsistency comes in: the rmate client is supposed to reveal the byte length of the payload at this point; actually, it gives the character length instead. Depending on which external character encoding the Ruby runtime assumes, the resulting character length is not necessarily equal to the byte length.

Ruby mostly chooses a external character encoding that fits the locale setting of the shell environment. If it’s LC_ALL=C, Ruby maps it to Encoding::US_ASCII; in other cases, Ruby loves to choose a multi-byte encoding as its external encoding, which triggers the bug.


The fix is to use String#bytesize instead of String#size.

String#size returns the number of characters, which, in the above scenario, would not match what the protocol expects.

String#bytesize returns exactly what it says on the tin.


I have double-checked the fix and can confirm that it works in the following environments:

sorbits commented 7 years ago

Thanks! I have merged and pushed a new version to https://rubygems.org/gems/rmate.