Open jamesdbrock opened 6 days ago
More comments from @garyb https://github.com/purescript-contrib/purescript-parsing/pull/232#discussion_r1804753425
It's just plain broken with
multiline
currently. It's advancingconsumed
by 7 for the"various"
match, but should advance it an additional 5 as that's the position that"various"
starts (I made sure to vary the line lengths for the example to reinforce how weird the behaviour is and have the resulting state start mid-line, as tried it with"foo\nbar\nbaz"
at first and almost ended up confusing myself :smile:)It is fixable if we want to support it though, we'd have to do something like perform a
search
to find the offset of the match and include that when updating theconsumed
position of the parser. Or most optimally, offer another function inpurescript-strings
that can report the offset along with the match, since the info is there in the returned object in the underlying JS, and then use that in the implementation here.
It would be nice to have the following parsers in Parsing.String.Basic
-- | Zero-width parser succeeds at beginning of input or following `\n` or `\r\n`.
lineBegin :: forall m. ParserT String m Unit
-- | Zero-width parser succeeds at `eof` or before `\n` or `\r\n`.
lineEnd :: forall m. ParsetT String m Unit
My opinion on disallowing multiline
comes from this:
I think personally I just view this parser as serving a different purpose than for what enabling
multiline
would give it. I view it as similar tosatisfy
, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.
(I'm not saying I'm right in this opinion, just giving context for why I wouldn't miss
multiline
at all).
And one more thought: when multiline
is enabled it allows the parser to both skip and consume at the same time, which is unlike any of the other basic parsers I think. But then again, we're essentially embedding another parsing language here, so I guess it's not that different than running a compound parser built with combinators that could have the same effect.
On the side of allowing it, I guess it's harmless enough since it's something you have to opt into - it won't affect me whether we allow and fix it or not, because I'm unlikely to use it either way. :smile:
I'd probably go further and suggest we might need two regex functions, a "simple" one which is basically what we have now, where it returns the whole match, and we'd disallow multiline
, global
, etc. for it. The new one would be the "full" regex where it allows whatever options and returns the match groups in the parser result (albeit still would have the implict ^(...)
around the provided pattern).
Since all multiline
cons got copied here but not pros, let me mention them as well:
multiline
is the behavior users would expect from regexes (for side-readers: "multiline" means "a $
will match a newline", and the suggestions in this thread is to basically remove such behavior).$
regex.regex
over other parsers, so people will frequently use regex
.is the behavior users would expect from regexes
I understand your point here, but for my background it's not true, so you can't claim it universally. :smile:
@Hi-Angel how do you feel about my suggestion of the two different regex parsers?
is the behavior users would expect from regexes
I understand your point here, but for my background it's not true, so you can't claim it universally. 😄
Out of curiosity, where do you usually use regular expressions and how? I struggle to come up with a prolonged workflow that would result in a different expectation. Text editors, terminal utilities, most programming languages, office suites… they all default to what JS calls a "multiline behavior".
@Hi-Angel how do you feel about my suggestion of the two different regex parsers?
Well, renaming regex
would be an API break, so I presume if you do that then most likely it would be a regex
and regexFull
. Let's presume that the distinction will be mentioned in the first paragraph, to exclude possibility that a user would give up reading everything because "oh, TL;DR, I know how to use regexes, so let's skip to the example to see how to use the args and try it".
This still leaves the problem that regexFull
will be buggy.
Hence, I'm not really clear what such change would solve. If you know that you don't have motivation or time to fix multiline
and you want to make very explicit that this option is broken, up to the point you're ready to change the API — in that case the easiest and IMO most productive way would be to rename multiline
to multilineBroken
, and of course add the documentation explaining the situation.
in that case the easiest and IMO most productive way would be to rename
multiline
tomultilineBroken
, and of course add the documentation explaining the situation.
I think in that case it's still worth mentioning somewhere at the beginning that the default behavior of $
is matching end of string and not end of a line. Because otherwise a user will spend time figuring out why matching doesn't work, before realizing they'd have to pass multilineBroken
, and then going to read the explanation of why it's "broken".
UPD: or maybe even better: adding an example with multilineBroken
, which a user would surely notice, and then would go reading why it is "broken".
-- | Matches `\n` or `\r\n`.
newline :: forall m. ParserT String m String
newline = string "\n" <|> string "\r\n"
This newline
parser is also one which we could consider adding. Maybe this is more useful than lineBegin
and lineEnd
. Certainly the implementation is simpler.
I mostly agree with @garyb, but I'd maybe go a bit further:
multiline
, global
, sticky
, and anchors (^
and $
) fall outside the scope of semantics purescript-parsing can support. We really need ^
to mean start of the string to make it work for the primary case. At the very least, I think we should document this note. Using them will result in undefined matching behavior. If we wanted a robust solution, we could actually validate the input regular expression, but that would be quite complicated and probably not worth it.^
) and should come with a big "DO NOT USE" warning. Why offer it then?And while I sympathize with the fact that JS regular expression semantics can be unintuitive, I don't think it's really the job of purescript-parsing to educate users about those semantics, or try to paper over those semantics in any way. It's FFI. It is what it is, and the note about undefined behavior should suffice. If anything, I think this is good evidence that baking in regular expression support (in any library) is problematic, and should really go in something like a purescript-parsing-js-regexp
library (same goes for Data.String.Regex
being in purescript-strings-js-regexp
).
Out of curiosity, where do you usually use regular expressions and how? I struggle to come up with a prolonged workflow that would result in a different expectation. Text editors, terminal utilities, most programming languages, office suites… they all default to what JS calls a "multiline behavior".
I think I just don't run into many situations where I need to match line boundaries. I think the last time I did was probably a year or two ago. :smile:
And the behaviour doesn't surprise me in here because JavaScript was my first programming language over 20 years ago and I've worked with it on and off ever since. :wink:
If you know that you don't have motivation or time to fix
multiline
and you want to make very explicit that this option is broken
That's not my point at all, the fix is trivial (see my note above that James copied over from the PR). It's more about the expectations of what regex
is intended for. Given that nobody has encountered this bug in the 2.5 years it's been defined this way, and that the current definition can't take advantage of regex groups, I don't imagine it's being used to do much heavy lifting in the wild (understandably, this is a parser combinator library after all), so the kind of thing your suggesting is essentially a new use case for regular expressions in the parser.
I think either way it's going to break stuff. If we disallow multiline
, global
, sticky
in the current regex
then it means we need to use a subset of the existing flags type. If we change it to be the fully featured version that can deal with groups then people are going to deal with a NonEmptyArray (Maybe String)
at every use site.
@natefaubion
There isn't a parser that can only be written with regular expressions.
Most combinators in the library are replaceable with a regex. In fact, in Emacs most parsers (except tree-sitter which Idk if it's using regexes) are indirectly or directly using regexes.
Emacs has its own combinators library with stuff like whitespace
, wordchar
, one-or-more
, that in compile-time gets turned to an IR, where the IR are optimized regexes. It's called rx
.
while I sympathize with the fact that JS regular expression semantics can be unintuitive, I don't think it's really the job of purescript-parsing to educate users about those semantics, or try to paper over those semantics in any way. It's FFI
That would've been fair if the library was just an FFI to a JS library, but it isn't. I don't see how FFI would've affected any of the design decisions.
@garyb
Given that nobody has encountered this bug in the 2.5 years it's been defined this way, and that the current definition can't take advantage of regex groups, I don't imagine it's being used to do much heavy lifting in the wild
I'm afraid the reason is different. Granted, I'm not a web developer (wasn't at least), so correct me if I'm wrong, but parsing is an exceptionally rare thing for a frontend to do. Instead it's usually done on backend. Now, before starting my fullstack project at work I've searched if someone written serious fullstack app in PureScript and IIRC I only found a few pet project stories. So I kind of took risk attempting that.
This point alone limits your auditory.
Furthermore, after digging into Parsing
I haven't found a way to get the all matches from the text while just advancing by one character on fail. You know, basic "grep
ping" stuff. Sure someone stumbled upon that, right? So I asked a question, which got participation from Parsing
maintainers in particular. Turned out, the best support the library has for this seemingly basic usecase is indirectly via splitCap
. Whereas within just a single day difference I got two different usecases requiring similar solution (the other case is the one briefly described with HashTable
), one of which was covered via the hack with splitCap
and the other one isn't.
And the splitCap
function isn't in .Combinators
where a user would search for a solution, and not even in .String
where they may find it accidentally. It's in .String.Replace
, which I never even looked into because why would I, I've got nothing to replace.
And note that this basic question "how do I grep with Parsing
" wasn't asked before, or at least google does not find it.
Furthermore, I've got a lot of runtime problems using Parsing
for something beyond basic stuff, which (though I'm not 100% sure, it's just difficult to analyze in retrospective and I've got a lot of events during development) I attribute for the most part to the odd design decision that "fail with consuming" and "fail without consuming" are different kinds of failure with apparently different level of support. I was told it was copied from Parsec
, but I never worked with Parsec
, so it's brand new for me here. But as a new library user I've got so much troubles with so much hacks in the code that α) I certainly would quit digging was it on my personal time and β) at this point I'm 80% sure if my project ± will get green light, I'll probably either look at another parsing lib or maybe write something ad-hoc (or maybe will replace PureScript on backend with something else, Idk, will see. I really like HTTPurple
though).
TL;DR of this wall is just that I doubt the library has enough users for an assumption "but nobody complained for 2.5 years" to work.
That would've been fair if the library was just an FFI to a JS library, but it isn't. I don't see how FFI would've affected any of the design decisions.
The regex
combinator is most certainly just FFI to a JS library in the end. That library is just the JS runtime, therefore it maps directly to JS RegExp semantics. There is no avoiding that, changing that, or papering over it. I sympathize because I think PureScript made an unfortunate (in hindsight) decision to treat Data.String.Regex as an apparently first class thing, when it really is just shelling out to FFI. I made the point above that this kind of functionality should really be in a JS
specific namespace to avoid that confusion, but that's difficult to change now.
Most combinators in the library are replaceable with a regex. In fact, in Emacs most parsers (except tree-sitter which Idk if it's using regexes) are indirectly or directly using regexes.
My point was the inverse. Parser combinators subsume regular expressions. You can implement all regular expression semantics in purescript-parsing without using the existing regex
combinator (which builds platform native regular expressions). You can then have any matching semantics you want! There would be no need for an odd little stringly DSL then. We don't do that, because we are assuming that users of the library are specifically wanting to use JS regular expressions for performance. I think it would be interesting to have a 100% PureScript regular expression library, but the performance would likely be orders of magnitude slower that platform native regular expressions which are all in native code, unfortunately.
My point was the inverse. Parser combinators subsume regular expressions. You can implement all regular expression semantics in purescript-parsing without using the existing
regex
combinator (which builds platform native regular expressions). You can then have any matching semantics you want! There would be no need for an odd little stringly DSL then.
Well, one basic parser that I haven't found in the library and had to implement with regexes (maybe I wasn't looking hard enough, but I just skimmed over all String-related parts and still don't see it) is a "word". A \bstring\b
or a \<string\>
regex.
To implement it I'd need a "word-boundary" parser, which I also do not see in the library.
To be clear, I don't mean to suggest that purescript-parsing exposes all primitive combinators to replace regular expression semantics. Then we would already have a non-RegExp-based regular expression library! I'm just saying that they can all be implemented since all parsers do is look at the string and consume something (or not). Again, I think it would be great to have combinators for all common regular expression tokens.
You're right. And on the second thought, a "word" can be implemented with takeWhile isAlphaNum
(unless I'm missing something).
Originally posted by @garyb in https://github.com/purescript-contrib/purescript-parsing/issues/232#issuecomment-2417400616
Should we forbid
multiline
? Should we make themultiline
flag work properly?