marko-js / htmljs-parser

An HTML parser recognizes content and string placeholders and allows JavaScript expressions as attribute values
MIT License
136 stars 20 forks source link

Regular text in THAI identified as tag #75

Closed cburatto closed 4 years ago

cburatto commented 4 years ago

I have the following string in English Before performing this bulk merge operation, you must have a recipient group and template in place.<BR>All bulk email communication sent through {1:product name} must meet the requirements defined in the Mass Email Messaging <a href='xxxxx' target='_blank'>Terms of Service</a>.<BR>

and a corresponding THAI translation

ก่อนที่จะดำเนินการในการส่งจดหมายถึงผู้รับหลายคนนี้ คุณต้องมีกลุ่มผู้รับและแม่แบบอยู่แล้ว<BR>การรับส่งอีเมลทั้งหมดในปริมาณมากซึ่งส่งผ่าน {1:product name} ต้องเป็นไปตามข้อกำหนดที่กำหนดไว้ใน<a href='xxxxx' target='_blank'>เงื่อนไขการให้บริการ</a>ด้านการส่งข้อความอีเมลจำนวนมาก<BR>

The parse is OK for the English string, but incorrectly identifies the Thai text as tags. For example:

{
      type: 'openTag',
      tagName: 'ก่อนที่จะดำเนินการในการส่งจดหมายถึงผู้รับหลายคนนี้',
      tagNameExpression: undefined,
      emptyTagName: undefined,
      argument: undefined,
      params: undefined,
      pos: 0,
      endPos: 313,
      tagNameEndPos: 50,
      openTagOnly: false,
      selfClosed: false,
      concise: true,
      attributes: [Array],
      setParseOptions: [Function]
    }

Is there any known specific configuration to be used for Thai language or other unicode, or any workaround I could use to eliminate this false positive?

Thanks

cburatto commented 4 years ago

The issue occurs with any text, not just THAI, and I might be missing some configuration. So here is an example:

let parser = require('htmljs-parser').createParser(
  {
    onOpenTag: function(event) {
      console.log(event);
    }
  }
);

parser.parse('This is a test');

In this case, the result will be:

{
  type: 'openTag',
  tagName: 'This',
  tagNameExpression: undefined,
  emptyTagName: undefined,
  argument: undefined,
  params: undefined,
  pos: 0,
  endPos: 14,
  tagNameEndPos: 4,
  openTagOnly: false,
  selfClosed: false,
  concise: true,
  attributes: [
    {
      name: 'is',
      value: undefined,
      pos: 4,
      endPos: 7,
      argument: undefined
    },
    {
      name: 'a',
      value: undefined,
      pos: 7,
      endPos: 9,
      argument: undefined
    },
    {
      name: 'test',
      value: undefined,
      pos: 9,
      endPos: 14,
      argument: undefined
    }
  ],
  setParseOptions: [Function]
}

Is there any way I can avoid regular text being parsed this way?

DylanPiercey commented 4 years ago

This is because parsing starts in concise mode by default (see https://markojs.com/docs/concise/#root-level-text)

I believe you can pass { concise: false } as a parse option to opt out of this.