purescript-contrib / purescript-parsing

A parser combinator library based on Parsec
https://pursuit.purescript.org/packages/purescript-parsing
BSD 2-Clause "Simplified" License
150 stars 52 forks source link

Regex for "match everything till eol" matches nothing #231

Open Hi-Angel opened 4 days ago

Hi-Angel commented 4 days ago

Describe the bug

Using a pretty simple ".*$" regexp results in No Regex pattern match. That is with noFlags flag that implies multiline: false, so $ should match something.

To Reproduce

Run this code:

module Main where

import Prelude

import Data.Either (Either(..))
import Data.String.Regex.Flags (noFlags)
import Effect (Effect)
import Effect.Console (logShow)
import Parsing (Parser, fail, runParser)
import Parsing.String (regex)

-- The regex from `Parsing` but with more convenient interface.
regexP :: String -> Parser String String
regexP re = case regex re noFlags of
  Left compileError -> fail $ "failed to compile regex: " <> compileError
  Right ret -> ret

s = """My
multiline
string
"""

main :: Effect Unit
main = do
  logShow $ runParser s $ regexP ".*$"

Expected behavior

The parser ought to return string My

Actual behavior

ParseError "No Regex pattern match" (Position { column: 1, index: 0, line: 1 }
garyb commented 4 days ago

You would need the multiline flag to be enabled for it to match "My" - but it should return "", because it'll be matching on $ but nothing else (. doesn't match newlines without dotAll - if you remove the trailing \n it should return "string").

I'm not really sure where the culprit is here, neither the parsing nor Data.String regex stuff looks wrong from an initial glance.

garyb commented 4 days ago

Oh actually, immediately after saying that I see it. The parser regex inserts a ^ at the start of the pattern so multiline or dotAll would be required for anything to match. I think it's correct that it only matches from the current parser position, but maybe the consequence of that needs emphasising in the comment for it (it does already mention it matches from the current position).

natefaubion commented 4 days ago

That is with noFlags flag that implies multiline: false, so $ should match something.

It seems like there's maybe a slight misunderstanding wrt JS RegExp semantics. Without the multiline flag $ always matches end of the string, not end of the line. Maybe you are expecting the inverse, where the multiline flag is where $ matches end of string, but it is in fact matching end of line? It's definitely confusing.

Hi-Angel commented 4 days ago

That is with noFlags flag that implies multiline: false, so $ should match something.

It seems like there's maybe a slight misunderstanding wrt JS RegExp semantics. Without the multiline flag $ always matches end of the string, not end of the line. Maybe you are expecting the inverse, where the multiline flag is where $ matches end of string, but it is in fact matching end of line? It's definitely confusing.

Oh, just as I was typing out, you clarified, thank you! Yes, I am not a web-programmer, and in all other languages "multline regex being true" means the opposite, i.e. when you consider whole string more or less as a single line. Okay, I see now, let me try that.

Hi-Angel commented 4 days ago

Oh actually, immediately after saying that I see it. The parser regex inserts a ^ at the start of the pattern so multiline or dotAll would be required for anything to match. I think it's correct that it only matches from the current parser position, but maybe the consequence of that needs emphasising in the comment for it (it does already mention it matches from the current position).

Sorry, I'm confused here 😅 So, the regex matches from current position, and for that it opaquely inserts a ^ to the beginning of the expression. This sounds like a bug on its own, because if a user have line1\n|line2 where the | marks the caret position, and then a user asks for ^line2, inserting the ^ will result in ^^line2.

natefaubion commented 4 days ago

It gets wrapped with ^(...). So If you add an extra ^, it's unnecessary, but benign. Matching from the current cursor position is the only thing that makes sense for a parser that needs to consume it's input. I'm not sure what it would mean for a regex combinator to match arbitrary locations. The JS API doesn't tell you where matches happened, so we need an anchor that assumes we can just jump based on the length of the match.

Hi-Angel commented 4 days ago

Please see my docs change proposal here

jamesdbrock commented 4 days ago

Thanks for the docs change proposal.

Also note that if you want “The regex from Parsing but with more convenient interface” then you can just use the Data.String.Regex module directly instead of using Parsing.

jamesdbrock commented 4 days ago

Also, I know this is just your test code, but note that in real code you shouldn't compile the regex during the parse for the reasons discussed in Parsing.String.regex.

-- The regex from `Parsing` but with more convenient interface.
regexP :: String -> Parser String String
regexP re = case regex re noFlags of
  Left compileError -> fail $ "failed to compile regex: " <> compileError
  Right ret -> ret
Hi-Angel commented 3 days ago

Thank you!

Also note that if you want “The regex from Parsing but with more convenient interface” then you can just use the Data.String.Regex module directly instead of using Parsing.

In the Parsing library context, the convenient interface is one that (in particular) returns a Parsing String String. The Data.String.Regex.regex returns Either, so still requires wrapping, and if I have to wrap, there's not much difference which one it's gonna be.

Also, I know this is just your test code, but note that in real code you shouldn't compile the regex during the parse for the reasons discussed in Parsing.String.regex.

Fair warning, because it's a snippet from an actual project. However, I've read the documentation paragraph referred (the discussion about constant vs dynamic generation in particular) but didn't grasp why should't I be compiling it in runtime.

Now that I'm experimenting with spy I see that it might be due to the fact that regexp doesn't get generated just once but on every call, perhaps that's what you were referring to? In that case though that's a problem, but not the one related to the code or the library, but one in the PureScript compiler. E.g. when I have this code:

-- a word at the beginning of line with optional whitespace
topLevelWord :: String -> Parser String String
topLevelWord s = regexP $ "^\\s*" <> s <> "\\b"
…
   topLevelWord "SomeFoo"

compiler should detect that it's a pure function-chain, so the calls can be replaced with a constant.

But it's part of the optimization process, and from what I understand the PureScript designers for some reason relegated optimizations to a separate project. So that means that if (or rather I'm hoping "when") my project comes to production, I'll have to use this separate project to produce the production code.

garyb commented 3 days ago

I see that it might be due to the fact that regexp doesn't get generated just once but on every call, perhaps that's what you were referring to?

That is what he meant, yeah.