doesn't support unicode and calls with unicode strings cause stream to hang

thekvn commented 8 years ago

Example:

> onn.k('show', '你好', function(err, d) { console.log(err, d) })
TypeError: value is out of bounds
    at checkInt (buffer.js:716:11)
    at Buffer.writeInt8 (buffer.js:864:5)
    at wb (/Users/kevin/dev/yb/ybend/node_modules/node-q/lib/c.js:370:5)
    at w (/Users/kevin/dev/yb/ybend/node_modules/node-q/lib/c.js:433:7)
    at w (/Users/kevin/dev/yb/ybend/node_modules/node-q/lib/c.js:450:7)
    at Object.serialize (/Users/kevin/dev/yb/ybend/node_modules/node-q/lib/c.js:473:2)
    at Connection.k (/Users/kevin/dev/yb/ybend/node_modules/node-q/index.js:125:11)
    at repl:1:9
    at REPLServer.defaultEval (repl.js:132:27)
    at bound (domain.js:254:14)
    at REPLServer.runBound [as eval] (domain.js:267:12)
    at REPLServer.<anonymous> (repl.js:279:12)
    at REPLServer.emit (events.js:107:17)
    at REPLServer.Interface._onLine (readline.js:214:10)
    at REPLServer.Interface._line (readline.js:553:8)
    at REPLServer.Interface._ttyWrite (readline.js:830:14)```

michaelwittig commented 8 years ago

To be honest, I was not even aware that supports non ASCII characters. But http://code.kx.com/wiki/Cookbook/Unicode knows better. So I have to look into this.

thekvn commented 8 years ago

Yeah it sure does.

Thanks for looking into this.

michaelwittig commented 8 years ago

I looked into it. The change is not simple.

daveonhols commented 8 years ago

Kdb doesn't really support unicode properly, depending on what you want to do this might not be worth bothering with. Strings in unicode are processed as raw bytes so count "你好" will return 6 (chinese chars are three bytes each). Furthermore, functions such as inter operate at the byte level, so something like "你好我叫德伟" inter "德伟" will probably not give what you might expect because of the actual raw bytes making up the characters.

michaelwittig commented 8 years ago

I'm als struggling with that because I now longer know the length of a string when counting bytes. That makes it more complicated and harder to predict so I at least assume that it comes with a certain cost when dealing with non ascii characters in q but I'm not an expert on that.

daveonhols commented 8 years ago

I think it ultimately comes down to the unicode spec from which you can infer how many bytes each character in a string requires by looking at the high order bits of each bytes (or something like that). So if you only cared about ASCII and Chinese, there are only two scenarios. From memory, ASCII is one byte and the highest order bit is zero, so the value of the byte is <128. For Chinese, the highest order bit is set, the char will be three bytes, you can gobble them all up and then move onto the next char... that is what I remember having done some time ago at least.

iuhh commented 8 years ago

Created a patch for this in our own use of node-q, as we needed unicode support.

BTW q supports unicode fine, see UTF8 spec. The complication comes from the fact JS uses UTF16 (or UCS2) internally so we need to convert from double byte to variable length. But since we are not dealing with the issue of whether counting in utf term or in human term, the approach below should work. Even for characters like 𝌆

btw Thanks Michael for your good work.

--- a/lib/c.js
+++ b/lib/c.js
@@ -51,7 +51,7 @@ function deserialize(b, nanos2date, flipTables, emptyChar2null) {
        return rInt8() === 1;
    }
    function rChar() {
-       var val = rInt8();
+       var val = rUInt8();
        if (val === 32 && emptyChar2null !== false) {
            return null;
        } else {
@@ -291,7 +291,7 @@ function deserialize(b, nanos2date, flipTables, emptyChar2null) {
            while (pos < n) {
                s += rChar();
            }
-           return s;
+           return decodeURIComponent(escape (s));
        }
        var A = new Array(n);
        var f = fns[t];
@@ -360,7 +360,11 @@ function serialize(x) {
                    return n;
                }
            case 'string':
-               return x.length + (x[0] === '`' ? 1 : 6);
+               {
+                   var encx = unescape(encodeURIComponent(x));
+                   var n=encx.length + (encx[0] === '`' ? 1 : 6);
+                   return n;
+               }
            case 'date':
                return 9;
            case 'symbol':
@@ -372,6 +376,10 @@ function serialize(x) {
        b.writeInt8(i, pos);
        pos += 1;
    }
+   function wub(i) {
+       b.writeUInt8(i, pos);
+       pos += 1;
+   }
    function wf(i) {
        b.writeDoubleLE(i, pos);
        pos += 8;
@@ -425,6 +433,7 @@ function serialize(x) {
                }
                break;
            case 'string':
+               x=unescape(encodeURIComponent(x));
                if (x[0] === '`') {
                    w(x.substr(1), 'symbol');
                } else {
@@ -432,7 +441,7 @@ function serialize(x) {
                    wb(0);
                    wn(4, x.length);
                    for (var i = 0; i < x.length; i++) {
-                       wb(x[i].charCodeAt());
+                       wub(x[i].charCodeAt());
                    }
                }
                break;

michaelwittig commented 8 years ago

@iuhh Thanks for your patch. Give me a few days and I will include your work till the end of the week

michaelwittig commented 8 years ago

Hi @iuhh I applied your patch and added an integration test. If you have time to look at branch https://github.com/michaelwittig/node-q/tree/issue17_2

start a q process listening on port 5000

$ q -p 5000

switch to the node-q directory and run the test

$ TZ=UTC ./node_modules/.bin/mocha -t 5000 itest/unicode.js

iuhh commented 8 years ago

Hi Michael,

The test did not run not successful for me


  unicode
    string
      1) count
      ✓ value
    symbol
      ✓ count
      2) value
    symbols
      ✓ count
      3) values

  3 passing (19ms)
  3 failing

  1) unicode string count:

      Uncaught AssertionError: res
      + expected - actual

      -6
      +2

      at itest/unicode.js:25:16
      at Connection.<anonymous> (index.js:129:4)
      at Socket.<anonymous> (index.js:69:11)
      at Socket.<anonymous> (_stream_readable.js:765:14)
      at emitReadable_ (_stream_readable.js:427:10)
      at emitReadable (_stream_readable.js:423:5)
      at readableAddChunk (_stream_readable.js:166:9)
      at Socket.Readable.push (_stream_readable.js:128:10)
      at TCP.onread (net.js:529:21)

  2) unicode symbol value:
     Uncaught Error: char
      at Socket.<anonymous> (index.js:56:12)
      at Socket.<anonymous> (_stream_readable.js:765:14)
      at emitReadable_ (_stream_readable.js:427:10)
      at emitReadable (_stream_readable.js:423:5)
      at readableAddChunk (_stream_readable.js:166:9)
      at Socket.Readable.push (_stream_readable.js:128:10)
      at TCP.onread (net.js:529:21)

  3) unicode symbols values:
     Uncaught Error: char
      at Socket.<anonymous> (index.js:56:12)
      at Socket.<anonymous> (_stream_readable.js:765:14)
      at emitReadable_ (_stream_readable.js:427:10)
      at emitReadable (_stream_readable.js:423:5)
      at readableAddChunk (_stream_readable.js:166:9)
      at Socket.Readable.push (_stream_readable.js:128:10)
      at TCP.onread (net.js:529:21)

michaelwittig commented 8 years ago

@iuhh thanks for your contribution. version 1.3 includes your patch. I made small changes to add unicode support for symbols as well.

michaelwittig / node-q

doesn't support unicode and calls with unicode strings cause stream to hang #17