purescript-contrib / purescript-parsing

A parser combinator library based on Parsec
https://pursuit.purescript.org/packages/purescript-parsing
BSD 2-Clause "Simplified" License
150 stars 52 forks source link

multiline flag doesn't work #233

Open jamesdbrock opened 6 days ago

jamesdbrock commented 6 days ago

I think maybe we should change the options that can even be used with regex parsing to exclude multiline:

logShow $ runParser "some\nvarious\nlines" (regexP "various$" *> PS.rest)

(Right "rious\nlines")

This is because it advances the consumed parser position by the length of the first pattern match. It would need to be able to know the offset of that match as well as the length of it to update consumed correctly. We could get that through running search too perhaps, but I think it would be nice to disallow flag(s) that don't make sense for a parser style definition.

Originally posted by @garyb in https://github.com/purescript-contrib/purescript-parsing/issues/232#issuecomment-2417400616

Should we forbid multiline? Should we make the multiline flag work properly?

jamesdbrock commented 6 days ago

More comments from @garyb https://github.com/purescript-contrib/purescript-parsing/pull/232#discussion_r1804753425

It's just plain broken with multiline currently. It's advancing consumed by 7 for the "various" match, but should advance it an additional 5 as that's the position that "various" starts (I made sure to vary the line lengths for the example to reinforce how weird the behaviour is and have the resulting state start mid-line, as tried it with "foo\nbar\nbaz" at first and almost ended up confusing myself :smile:)

It is fixable if we want to support it though, we'd have to do something like perform a search to find the offset of the match and include that when updating the consumed position of the parser. Or most optimally, offer another function in purescript-strings that can report the offset along with the match, since the info is there in the returned object in the underlying JS, and then use that in the implementation here.

jamesdbrock commented 6 days ago

It would be nice to have the following parsers in Parsing.String.Basic

-- | Zero-width parser succeeds at beginning of input or following `\n` or `\r\n`.
lineBegin :: forall m. ParserT String m Unit
-- | Zero-width parser succeeds at `eof` or before `\n` or `\r\n`.
lineEnd :: forall m. ParsetT String m Unit
garyb commented 6 days ago

My opinion on disallowing multiline comes from this:

I think personally I just view this parser as serving a different purpose than for what enabling multiline would give it. I view it as similar to satisfy, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.

It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.

(I'm not saying I'm right in this opinion, just giving context for why I wouldn't miss multiline at all).

And one more thought: when multiline is enabled it allows the parser to both skip and consume at the same time, which is unlike any of the other basic parsers I think. But then again, we're essentially embedding another parsing language here, so I guess it's not that different than running a compound parser built with combinators that could have the same effect.

On the side of allowing it, I guess it's harmless enough since it's something you have to opt into - it won't affect me whether we allow and fix it or not, because I'm unlikely to use it either way. :smile:

I'd probably go further and suggest we might need two regex functions, a "simple" one which is basically what we have now, where it returns the whole match, and we'd disallow multiline, global, etc. for it. The new one would be the "full" regex where it allows whatever options and returns the match groups in the parser result (albeit still would have the implict ^(...) around the provided pattern).

Hi-Angel commented 6 days ago

Since all multiline cons got copied here but not pros, let me mention them as well:

  1. multiline is the behavior users would expect from regexes (for side-readers: "multiline" means "a $ will match a newline", and the suggestions in this thread is to basically remove such behavior).
  2. In parsing in general, matching newline characters manually is frowned upon in preference of some "newline combinator", and as mentioned a few comments above there currently isn't a parser for that besides the $ regex.
  3. The documentation recommends to prefer regex over other parsers, so people will frequently use regex.
garyb commented 6 days ago

is the behavior users would expect from regexes

I understand your point here, but for my background it's not true, so you can't claim it universally. :smile:

garyb commented 6 days ago

@Hi-Angel how do you feel about my suggestion of the two different regex parsers?

Hi-Angel commented 6 days ago

is the behavior users would expect from regexes

I understand your point here, but for my background it's not true, so you can't claim it universally. 😄

Out of curiosity, where do you usually use regular expressions and how? I struggle to come up with a prolonged workflow that would result in a different expectation. Text editors, terminal utilities, most programming languages, office suites… they all default to what JS calls a "multiline behavior".

@Hi-Angel how do you feel about my suggestion of the two different regex parsers?

Well, renaming regex would be an API break, so I presume if you do that then most likely it would be a regex and regexFull. Let's presume that the distinction will be mentioned in the first paragraph, to exclude possibility that a user would give up reading everything because "oh, TL;DR, I know how to use regexes, so let's skip to the example to see how to use the args and try it".

This still leaves the problem that regexFull will be buggy.

Hence, I'm not really clear what such change would solve. If you know that you don't have motivation or time to fix multiline and you want to make very explicit that this option is broken, up to the point you're ready to change the API — in that case the easiest and IMO most productive way would be to rename multiline to multilineBroken, and of course add the documentation explaining the situation.

Hi-Angel commented 6 days ago

in that case the easiest and IMO most productive way would be to rename multiline to multilineBroken, and of course add the documentation explaining the situation.

I think in that case it's still worth mentioning somewhere at the beginning that the default behavior of $ is matching end of string and not end of a line. Because otherwise a user will spend time figuring out why matching doesn't work, before realizing they'd have to pass multilineBroken, and then going to read the explanation of why it's "broken".

UPD: or maybe even better: adding an example with multilineBroken, which a user would surely notice, and then would go reading why it is "broken".

jamesdbrock commented 6 days ago
-- | Matches `\n` or `\r\n`.
newline :: forall m. ParserT String m String
newline = string "\n" <|> string "\r\n"

This newline parser is also one which we could consider adding. Maybe this is more useful than lineBegin and lineEnd. Certainly the implementation is simpler.

natefaubion commented 6 days ago

I mostly agree with @garyb, but I'd maybe go a bit further:

natefaubion commented 6 days ago

And while I sympathize with the fact that JS regular expression semantics can be unintuitive, I don't think it's really the job of purescript-parsing to educate users about those semantics, or try to paper over those semantics in any way. It's FFI. It is what it is, and the note about undefined behavior should suffice. If anything, I think this is good evidence that baking in regular expression support (in any library) is problematic, and should really go in something like a purescript-parsing-js-regexp library (same goes for Data.String.Regex being in purescript-strings-js-regexp).

garyb commented 6 days ago

Out of curiosity, where do you usually use regular expressions and how? I struggle to come up with a prolonged workflow that would result in a different expectation. Text editors, terminal utilities, most programming languages, office suites… they all default to what JS calls a "multiline behavior".

I think I just don't run into many situations where I need to match line boundaries. I think the last time I did was probably a year or two ago. :smile:

And the behaviour doesn't surprise me in here because JavaScript was my first programming language over 20 years ago and I've worked with it on and off ever since. :wink:

If you know that you don't have motivation or time to fix multiline and you want to make very explicit that this option is broken

That's not my point at all, the fix is trivial (see my note above that James copied over from the PR). It's more about the expectations of what regex is intended for. Given that nobody has encountered this bug in the 2.5 years it's been defined this way, and that the current definition can't take advantage of regex groups, I don't imagine it's being used to do much heavy lifting in the wild (understandably, this is a parser combinator library after all), so the kind of thing your suggesting is essentially a new use case for regular expressions in the parser.

I think either way it's going to break stuff. If we disallow multiline, global, sticky in the current regex then it means we need to use a subset of the existing flags type. If we change it to be the fully featured version that can deal with groups then people are going to deal with a NonEmptyArray (Maybe String) at every use site.

Hi-Angel commented 6 days ago

@natefaubion

There isn't a parser that can only be written with regular expressions.

Most combinators in the library are replaceable with a regex. In fact, in Emacs most parsers (except tree-sitter which Idk if it's using regexes) are indirectly or directly using regexes.

Emacs has its own combinators library with stuff like whitespace, wordchar, one-or-more, that in compile-time gets turned to an IR, where the IR are optimized regexes. It's called rx.

while I sympathize with the fact that JS regular expression semantics can be unintuitive, I don't think it's really the job of purescript-parsing to educate users about those semantics, or try to paper over those semantics in any way. It's FFI

That would've been fair if the library was just an FFI to a JS library, but it isn't. I don't see how FFI would've affected any of the design decisions.

Hi-Angel commented 6 days ago

@garyb

Given that nobody has encountered this bug in the 2.5 years it's been defined this way, and that the current definition can't take advantage of regex groups, I don't imagine it's being used to do much heavy lifting in the wild

I'm afraid the reason is different. Granted, I'm not a web developer (wasn't at least), so correct me if I'm wrong, but parsing is an exceptionally rare thing for a frontend to do. Instead it's usually done on backend. Now, before starting my fullstack project at work I've searched if someone written serious fullstack app in PureScript and IIRC I only found a few pet project stories. So I kind of took risk attempting that.

This point alone limits your auditory.

Furthermore, after digging into Parsing I haven't found a way to get the all matches from the text while just advancing by one character on fail. You know, basic "grepping" stuff. Sure someone stumbled upon that, right? So I asked a question, which got participation from Parsing maintainers in particular. Turned out, the best support the library has for this seemingly basic usecase is indirectly via splitCap. Whereas within just a single day difference I got two different usecases requiring similar solution (the other case is the one briefly described with HashTable), one of which was covered via the hack with splitCap and the other one isn't.

And the splitCap function isn't in .Combinators where a user would search for a solution, and not even in .String where they may find it accidentally. It's in .String.Replace, which I never even looked into because why would I, I've got nothing to replace.

And note that this basic question "how do I grep with Parsing" wasn't asked before, or at least google does not find it.

Furthermore, I've got a lot of runtime problems using Parsing for something beyond basic stuff, which (though I'm not 100% sure, it's just difficult to analyze in retrospective and I've got a lot of events during development) I attribute for the most part to the odd design decision that "fail with consuming" and "fail without consuming" are different kinds of failure with apparently different level of support. I was told it was copied from Parsec, but I never worked with Parsec, so it's brand new for me here. But as a new library user I've got so much troubles with so much hacks in the code that α) I certainly would quit digging was it on my personal time and β) at this point I'm 80% sure if my project ± will get green light, I'll probably either look at another parsing lib or maybe write something ad-hoc (or maybe will replace PureScript on backend with something else, Idk, will see. I really like HTTPurple though).


TL;DR of this wall is just that I doubt the library has enough users for an assumption "but nobody complained for 2.5 years" to work.

natefaubion commented 6 days ago

That would've been fair if the library was just an FFI to a JS library, but it isn't. I don't see how FFI would've affected any of the design decisions.

The regex combinator is most certainly just FFI to a JS library in the end. That library is just the JS runtime, therefore it maps directly to JS RegExp semantics. There is no avoiding that, changing that, or papering over it. I sympathize because I think PureScript made an unfortunate (in hindsight) decision to treat Data.String.Regex as an apparently first class thing, when it really is just shelling out to FFI. I made the point above that this kind of functionality should really be in a JS specific namespace to avoid that confusion, but that's difficult to change now.

Most combinators in the library are replaceable with a regex. In fact, in Emacs most parsers (except tree-sitter which Idk if it's using regexes) are indirectly or directly using regexes.

My point was the inverse. Parser combinators subsume regular expressions. You can implement all regular expression semantics in purescript-parsing without using the existing regex combinator (which builds platform native regular expressions). You can then have any matching semantics you want! There would be no need for an odd little stringly DSL then. We don't do that, because we are assuming that users of the library are specifically wanting to use JS regular expressions for performance. I think it would be interesting to have a 100% PureScript regular expression library, but the performance would likely be orders of magnitude slower that platform native regular expressions which are all in native code, unfortunately.

Hi-Angel commented 6 days ago

My point was the inverse. Parser combinators subsume regular expressions. You can implement all regular expression semantics in purescript-parsing without using the existing regex combinator (which builds platform native regular expressions). You can then have any matching semantics you want! There would be no need for an odd little stringly DSL then.

Well, one basic parser that I haven't found in the library and had to implement with regexes (maybe I wasn't looking hard enough, but I just skimmed over all String-related parts and still don't see it) is a "word". A \bstring\b or a \<string\> regex.

To implement it I'd need a "word-boundary" parser, which I also do not see in the library.

natefaubion commented 6 days ago

To be clear, I don't mean to suggest that purescript-parsing exposes all primitive combinators to replace regular expression semantics. Then we would already have a non-RegExp-based regular expression library! I'm just saying that they can all be implemented since all parsers do is look at the string and consume something (or not). Again, I think it would be great to have combinators for all common regular expression tokens.

Hi-Angel commented 6 days ago

You're right. And on the second thought, a "word" can be implemented with takeWhile isAlphaNum (unless I'm missing something).