tapmodo / node-ldif

Nodejs LDIF (LDAP Data Interchange Format) parser based on RFC2849
9 stars 7 forks source link

Definition of whitespace is broken #11

Closed jasonk closed 5 years ago

jasonk commented 6 years ago

This started out as a bug report about node-ldif choking on LDIF input that had multiple comments separated by blank lines. It turns out the real root cause of this problem is the way the library defines whitespace:

whitespace "WHITESPACE"
  = comment_line
  / [\s]* SEP

pegjs doesn't support regex character classes (pegjs/pegjs#247), so the second half of that does not end up meaning "lines containing only whitespace", it ends up meaning "lines containing only zero-or-more letter 's'".

You can see what happens if you look at the parser code that gets generated from that definition:

        peg$c93 = { type: "other", description: "WHITESPACE" },
        peg$c94 = /^[s]/,

According to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp the \s character class is equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] (note that the first character of that class is a literal space). Since pegjs does understand literal character classes, you should be able to just use that instead of \s and fix the problem (although you might want to leave \n\r out since those are part of SEP).