Regular text in THAI identified as tag

cburatto commented 4 years ago

I have the following string in English Before performing this bulk merge operation, you must have a recipient group and template in place.<BR>All bulk email communication sent through {1:product name} must meet the requirements defined in the Mass Email Messaging <a href='xxxxx' target='_blank'>Terms of Service</a>.<BR>

and a corresponding THAI translation

ก่อนที่จะดำเนินการในการส่งจดหมายถึงผู้รับหลายคนนี้ คุณต้องมีกลุ่มผู้รับและแม่แบบอยู่แล้ว<BR>การรับส่งอีเมลทั้งหมดในปริมาณมากซึ่งส่งผ่าน {1:product name} ต้องเป็นไปตามข้อกำหนดที่กำหนดไว้ใน<a href='xxxxx' target='_blank'>เงื่อนไขการให้บริการ</a>ด้านการส่งข้อความอีเมลจำนวนมาก<BR>

The parse is OK for the English string, but incorrectly identifies the Thai text as tags. For example:

{
      type: 'openTag',
      tagName: 'ก่อนที่จะดำเนินการในการส่งจดหมายถึงผู้รับหลายคนนี้',
      tagNameExpression: undefined,
      emptyTagName: undefined,
      argument: undefined,
      params: undefined,
      pos: 0,
      endPos: 313,
      tagNameEndPos: 50,
      openTagOnly: false,
      selfClosed: false,
      concise: true,
      attributes: [Array],
      setParseOptions: [Function]
    }

Is there any known specific configuration to be used for Thai language or other unicode, or any workaround I could use to eliminate this false positive?

Thanks

cburatto commented 4 years ago

The issue occurs with any text, not just THAI, and I might be missing some configuration. So here is an example:

let parser = require('htmljs-parser').createParser(
  {
    onOpenTag: function(event) {
      console.log(event);
    }
  }
);

parser.parse('This is a test');

In this case, the result will be:

{
  type: 'openTag',
  tagName: 'This',
  tagNameExpression: undefined,
  emptyTagName: undefined,
  argument: undefined,
  params: undefined,
  pos: 0,
  endPos: 14,
  tagNameEndPos: 4,
  openTagOnly: false,
  selfClosed: false,
  concise: true,
  attributes: [
    {
      name: 'is',
      value: undefined,
      pos: 4,
      endPos: 7,
      argument: undefined
    },
    {
      name: 'a',
      value: undefined,
      pos: 7,
      endPos: 9,
      argument: undefined
    },
    {
      name: 'test',
      value: undefined,
      pos: 9,
      endPos: 14,
      argument: undefined
    }
  ],
  setParseOptions: [Function]
}

Is there any way I can avoid regular text being parsed this way?

DylanPiercey commented 4 years ago

This is because parsing starts in concise mode by default (see https://markojs.com/docs/concise/#root-level-text)

I believe you can pass { concise: false } as a parse option to opt out of this.

marko-js / htmljs-parser

Regular text in THAI identified as tag #75