rasendubi / uniorg

An accurate Org-mode parser for JavaScript/TypeScript
https://oleksii.shmalko.com/uniorg
GNU General Public License v3.0
256 stars 24 forks source link

`ZERO WIDTH SPACE` character is not recongized correctly #98

Closed RangHo closed 3 months ago

RangHo commented 8 months ago

As per the official documentation, I am using U+200B (ZERO WIDTH SPACE) to break up Org markups in the middle of a "word".

However, current implementation seems to ignore its role as an escape character (or a separator) as below.

The following Org markup, with literal X representing a zero width space character for clarity...

~code~Xhello

=verbatim=Xhello

**bold**Xhello

/italic/Xhello

...should be parsed as...

{
  "type": "org-data",
  "contentsBegin": 0,
  "contentsEnd": 62,
  "children": [
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 0,
      "contentsEnd": 13,
      "children": [
        {
          "type": "code",
          "value": "code",
          "children": []
        },
        {
          "type": "text",
          "value": "Xhello\n"
        }
      ]
    },
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 14,
      "contentsEnd": 31,
      "children": [
        {
          "type": "verbatim",
          "value": "verbatim",
          "children": []
        },
        {
          "type": "text",
          "value": "Xhello\n"
        }
      ]
    },
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 32,
      "contentsEnd": 47,
      "children": [
        {
          "type": "bold",
          "contentsBegin": 33,
          "contentsEnd": 39,
          "children": [
            {
              "type": "bold",
              "contentsBegin": 34,
              "contentsEnd": 38,
              "children": [
                {
                  "type": "text",
                  "value": "bold"
                }
              ]
            }
          ]
        },
        {
          "type": "text",
          "value": "Xhello\n"
        }
      ]
    },
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 48,
      "contentsEnd": 62,
      "children": [
        {
          "type": "italic",
          "contentsBegin": 49,
          "contentsEnd": 55,
          "children": [
            {
              "type": "text",
              "value": "italic"
            }
          ]
        },
        {
          "type": "text",
          "value": "Xhello"
        }
      ]
    }
  ]
}

...but the current implementation parses it as...

{
  "type": "org-data",
  "contentsBegin": 0,
  "contentsEnd": 62,
  "children": [
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 0,
      "contentsEnd": 13,
      "children": [
        {
          "type": "text",
          "value": "~code~Xhello\n"
        }
      ]
    },
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 14,
      "contentsEnd": 31,
      "children": [
        {
          "type": "text",
          "value": "=verbatim=Xhello\n"
        }
      ]
    },
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 32,
      "contentsEnd": 47,
      "children": [
        {
          "type": "text",
          "value": "**bold**Xhello\n"
        }
      ]
    },
    {
      "type": "paragraph",
      "affiliated": {},
      "contentsBegin": 48,
      "contentsEnd": 62,
      "children": [
        {
          "type": "text",
          "value": "/italic/Xhello"
        }
      ]
    }
  ]
}
ispringle commented 6 months ago

I suspect this is related to #51