Currently any element from the html that couldn't be recognized by the tokenizer will be tokenized into a COMMENT token. Since the catalog can be inconsistent in its way of phrasing, our tokenizer can miss some of these cases. This ticket will mostly be going into the comments.json and see which case we can address with the tokenizer, as well as writing some tests for them to ensure backwards compatibility with older majors (especially the XOM phrasing)
Currently any element from the html that couldn't be recognized by the tokenizer will be tokenized into a
COMMENT
token. Since the catalog can be inconsistent in its way of phrasing, our tokenizer can miss some of these cases. This ticket will mostly be going into thecomments.json
and see which case we can address with the tokenizer, as well as writing some tests for them to ensure backwards compatibility with older majors (especially the XOM phrasing)