jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
189 stars 61 forks source link

Regex we can't parse #118

Closed jgm closed 3 years ago

jgm commented 3 years ago
 \((?=(?:[^  \\'"|()`ugeP]*+(?:[ug][0123456789]+|[ugeP](?::(?:[^:'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?::|(?=[|()]))|\[(?:[^]'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\](?=[|()]))|{(?:[^}'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:}(?=[|()]))|<(?:[^>'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:>|(?=[|()]))|([^  <>|&;(){}'"`\\])(?:(?:(?!\1)[^'`"\\\1|()])*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\1|(?=[|()]))|(?=\|))|\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\)(?:[^  }<>|&;)]|(?:}+(?:[^*?#^~[  <>|&;()${}'"`\\]|(?=[}$'"`\\]))))|[  |(]|$))
jgm commented 3 years ago

This (from zsh.xml) and an increasing number of syntax definitions use possessive quantifiers ++ and *+. https://stackoverflow.com/questions/4489551/what-is-double-plus-in-regular-expressions

We should support these so we can use the up-to-date syntax definitions.

jgm commented 3 years ago

Wait, we DO supposedly support possessive quantifiers. So maybe that is not the issue here.

jgm commented 3 years ago

More regex parse failures

zsh.xml

\$\((?=\(((?:[^`'"()$]++|\$\{[^`'"(){}$]+\}|\$(?=[^{`'"()])|`[^`]*+`|\((?1)(?:[)]|(?=['"])))++)(?:[)](?=$|[^)])|["']))| 

\((?=(?:[^  \\'"|()`ugeP]*+(?:[ug][0123456789]+|[ugeP](?::(?:[^:'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?::|(?=[|()]))|\[(?:[^]'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\](?=[|()]))|{(?:[^}'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:}(?=[|()]))|<(?:[^>'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:>|(?=[|()]))|([^    <>|&;(){}'"`\\])(?:(?:(?!\1)[^'`"\\\1|()])*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\1|(?=[|()]))|(?=\|))|\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\)(?:[^  }<>|&;)]|(?:}+(?:[^*?#^~[   <>|&;()${}'"`\\]|(?=[}$'"`\\]))))|[     |(]|$))

\((?=(?:[^  \\'"|()`ugeP]*+(?:[ug][0123456789]+|[ugeP](?::(?:[^:'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?::|(?=[|()]))|\[(?:[^]'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\](?=[|()]))|{(?:[^}'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:}(?=[|()]))|<(?:[^>'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:>|(?=[|()]))|([^    <>|&;(){}'"`\\])(?:(?:(?!\1)[^'`"\\\1|()])*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\1|(?=[|()]))|(?=\|))|\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\)(?:[^  }<>|&;)]|(?:}+(?:[^*?#^~[   <>|&;()${}'"`\\]|(?=[}$'"`\\]))))|[     |(]|$))

\((?=\(((?:[^`'"()$]++|\$\{[^`'"(){}$]+\}|\$(?=[^{`'"()])|`[^`]*+`|\((?1)(?:[)]|(?=['"])))++)(?:[)](?=$|[^)])|["']))|

in python.xml

\w++(?=\s+(?!(?:if|for)\b)[\w'"~]|\s*+(?![.:=\]),]|(?:if|for)\b)((?:(?:ru|u?r|)(?:'(?:[^']++|\\')*+'|"(?:[^"]++|\\")*+")|(?:r?f|fr?)(?:'(?:[^'{]++|\\'|\{\{|\{[^}]++\})*+'|"(?:[^"]++|\\"|\{\{|\{[^}]*+\})*+")|[^#;(){}]|\(\)|\((?1)\)|\{\}|\{(\s*+(?:(?:(?:ru|u?r|)(?:'(?:[^']++|\\')*+'|"(?:[^"]++|\\")*+")|(?:r?f|fr?)(?:'(?:[^'{]++|\\'|\{\{|\{[^}]++\})*+'|"(?:[^"]++|\\"|\{\{|\{[^}]*+\})*+")|[a-zA-Z0-9.]++)\s*+:\s*+(?:(?:ru|u?r|)(?:'(?:[^']++|\\')*+'|"(?:[^"]++|\\")*+")|(?:r?f|fr?)(?:'(?:[^'{]++|\\'|\{\{|\{[^}]++\})*+'|"(?:[^"]++|\\"|\{\{|\{[^}]*+\})*+")|[^#;(){},]|\(\)|\((?1)\)|\{\}|\{(?2)\})++,?)*+)\})+?):)|

in bash.xml

\$\((?=\(((?:[^`'"()$]++|\$\{[^`'"(){}$]+\}|\$(?=[^{`'"()])|`[^`]*+`|\((?1)(?:[)]|(?=['"])))++)(?:[)](?=$|[^)])|["']))|

\((?=\(((?:[^`'"()$]++|\$\{[^`'"(){}$]+\}|\$(?=[^{`'"()])|`[^`]*+`|\((?1)(?:[)]|(?=['"])))++)(?:[)](?=$|[^)])|["']))|

\((?=\(((?:[^`'"()$]++|\$\{[^`'"(){}$]+\}|\$(?=[^{`'"()])|`[^`]*+`|\((?1)(?:[)]|(?=['"])))++)(?:[)](?=$|[^)])|["']))| 

looks like the final pipe is a feature of many of these...

jgm commented 3 years ago

(?1) fails. https://www.regular-expressions.info/subroutine.html (?1) is supposed to match the regex inside the first capturing group.

jgm commented 3 years ago

I've implemented Subroutine and that takes care of everything but this regex from zsh.xml.

 \((?=(?:[^        \\'"|()`ugeP]*+(?:[ug][0123456789]+|[ugeP](?::(?:[^:'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?::|(?=[|()]))|\[(?:[^]'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\](?=[|()]))|{(?:[^}'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:}(?=[|()]))|<(?:[^>'`"\\|()]*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:>|(?=[|()]))|([^     <>|&;(){}'"`\\])(?:(?:(?!\1)[^'`"\\\1|()])*+(?:\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\1|(?=[|()]))|(?=\|))|\\.|'[^']*'|`[^`]`|"(?:[^"\\`]*+(?:`[^`]`|\\.)?)*")?)*(?:\)(?:[^  }<>|&;)]|(?:}+(?:[^*?#^~[   <>|&;()${}'"`\\]|(?=[}$'"`\\]))))|[     |(]|$))
jgm commented 3 years ago

Haskell escaped for convenience:

" \\((?=(?:[^        \\\\'\"|()`ugeP]*+(?:[ug][0123456789]+|[ugeP](?::(?:[^:'`\"\\\\|()]*+(?:\\\\.|'[^']*'|`[^`]`|\"(?:[^\"\\\\`]*+(?:`[^`]`|\\\\.)?)*\")?)*(?::|(?=[|()]))|\\[(?:[^]'`\"\\\\|()]*+(?:\\\\.|'[^']*'|`[^`]`|\"(?:[^\"\\\\`]*+(?:`[^`]`|\\\\.)?)*\")?)*(?:\\](?=[|()]))|{(?:[^}'`\"\\\\|()]*+(?:\\\\.|'[^']*'|`[^`]`|\"(?:[^\"\\\\`]*+(?:`[^`]`|\\\\.)?)*\")?)*(?:}(?=[|()]))|<(?:[^>'`\"\\\\|()]*+(?:\\\\.|'[^']*'|`[^`]`|\"(?:[^\"\\\\`]*+(?:`[^`]`|\\\\.)?)*\")?)*(?:>|(?=[|()]))|([^ \t<>|&;(){}'\"`\\\\])(?:(?:(?!\\1)[^'`\"\\\\\\1|()])*+(?:\\\\.|'[^']*'|`[^`]`|\"(?:[^\"\\\\`]*+(?:`[^`]`|\\\\.)?)*\")?)*(?:\\1|(?=[|()]))|(?=\\|))|\\\\.|'[^']*'|`[^`]`|\"(?:[^\"\\\\`]*+(?:`[^`]`|\\\\.)?)*\")?)*(?:\\)(?:[^ \t}<>|&;)]|(?:}+(?:[^*?#^~[ \t<>|&;()${}'\"`\\\\]|(?=[}$'\"`\\\\]))))|[ \t|(]|$))"
jgm commented 3 years ago

The parser says:

Left "Failed reading: parse error at byte position 3"
jgm commented 3 years ago

Isolated the problem to

[^'`"\\\1|()]