Wrong end_pos for chinese charecters

debanjandhar12 commented 2 years ago

Description:

The end_pos data is calculated incorrectly when input has chinese charecters.

Example:

Code:

let mldocsOptions = {
        "toc": false,
        "heading_number": false,
        "keep_line_break": false,
        "format": "Org",
        "heading_to_list": false,
        "exporting_keep_properties": false,
        "inline_type_with_pos": true,
        "export_md_remove_options": [],
        "hiccup_in_block": true
    };

Mldoc.parseJson("我能做的任何我想要做到的事情",
    JSON.stringify(mldocsOptions),
    JSON.stringify({})
);

Output:

[[["Plain","我能做的任何我想要做到的事情"],{"start_pos":0,"end_pos":42}]]

The actual output should have been [[["Plain","我能做的任何我想要做到的事情"],{"start_pos":0,"end_pos":14}]] as the string "我能做的任何我想要做到的事情" has a length of 14.

RCmerci commented 2 years ago

start_pos&end_pos here is calculated in byte-based.

e = new TextEncoder("utf-8")
// TextEncoder {encoding: 'utf-8'}
e.encode("我能做的任何我想要做到的事情")
// Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array']
e.encode("我能做的任何我想要做到的事情").length
// 42

debanjandhar12 commented 2 years ago

start_pos&end_pos here is calculated in byte-based.

e = new TextEncoder("utf-8")
// TextEncoder {encoding: 'utf-8'}
e.encode("我能做的任何我想要做到的事情")
// Uint8Array(42) [230, 136, 145, 232, 131, 189, 229, 129, 154, 231, 154, 132, 228, 187, 187, 228, 189, 149, 230, 136, 145, 230, 131, 179, 232, 166, 129, 229, 129, 154, 229, 136, 176, 231, 154, 132, 228, 186, 139, 230, 131, 133, buffer: ArrayBuffer(42), byteLength: 42, byteOffset: 0, length: 42, Symbol(Symbol.toStringTag): 'Uint8Array']
e.encode("我能做的任何我想要做到的事情").length
// 42

I see. Thanks a lot for the help. I looked into ocaml after posting the issue and it seems it works with 8-bit character array. So I guess this makes sense.

logseq / mldoc

Wrong end_pos for chinese charecters #120

Description:

Example: