Closed debanjandhar12 closed 2 years ago
start_pos&end_pos here is calculated in byte-based.
e = new TextEncoder("utf-8")
// TextEncoder {encoding: 'utf-8'}
e.encode("我能做的任何我想要做到的事情")
// Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array']
e.encode("我能做的任何我想要做到的事情").length
// 42
start_pos&end_pos here is calculated in byte-based.
e = new TextEncoder("utf-8") // TextEncoder {encoding: 'utf-8'} e.encode("我能做的任何我想要做到的事情") // Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array'] e.encode("我能做的任何我想要做到的事情").length // 42
I see. Thanks a lot for the help. I looked into ocaml after posting the issue and it seems it works with 8-bit character array. So I guess this makes sense.
Description:
The
end_pos
data is calculated incorrectly when input has chinese charecters.Example:
Code:
Output:
The actual output should have been
[[["Plain","我能做的任何我想要做到的事情"],{"start_pos":0,"end_pos":14}]]
as the string"我能做的任何我想要做到的事情"
has a length of 14.