vindarel / cl-str

Modern, simple and consistent Common Lisp string manipulation library.
https://vindarel.github.io/cl-str/
MIT License
309 stars 37 forks source link

split string doesn't allow regular expression #63

Closed mdbergmann closed 11 months ago

mdbergmann commented 3 years ago

See:

CL-USER> (str:split "\\n\\n" (str:join "" '("foo" #\newline #\newline "bar")))
("foo

bar")
CL-USER> (ppcre:split "\\n\\n" (str:join "" '("foo" #\newline #\newline "bar")))
("foo" "bar")
CL-USER> (ppcre:split "\\s\\s" (str:join "" '("foo" #\newline #\newline "bar")))
("foo" "bar")

Since under the hoods also ppcre is used it would be great to support splitting by regex. Maybe a separate function re-split?

vindarel commented 3 years ago

Hello, indeed, and that is a feature. str:split explicitly quotes meta characters to not allow regexps. It should be explicit with the documentation and the docstring.

And indeed, we can use ppcre:split for that (and we always can because ppcre is a dependency). At first sight, I find that adding re-split would not have added value and is not worth duplicating a function. Enhancing the README and the docstring to refer to ppcre would have been enough in your case?

mdbergmann commented 3 years ago

explicitly quotes meta characters to not allow regexps

Yeah. I've seen that. Had a glimpse at the sources.

Enhancing the README and the docstring to refer to ppcre would have been enough in your case?

Well. I guess it has to if you don't want to add it. I find it unfortunate however to fall back to ppcre directly to perform a split of a string which enforces me to mix namespaces of 'str' and 'ppcre' when only 'str' would suffice.

From an API perspective this could be controlled via key parameters, including the rsplit to just use split, for instance:

(split "o" "foo" :reverse)   ;; instead of `rsplit`

(split "o{2}" "foor" :regex)

Manfred

vindarel commented 3 years ago

if you don't want to add it.

I don't close the possibility.

I find it unfortunate however to fall back to ppcre directly to perform a split of a string which enforces me to mix namespaces of 'str' and 'ppcre' when only 'str' would suffice.

yeah I understand this too. But:

this could be controlled via key parameters

yes +1, we do this for some functions but it could be generalized.

mdbergmann commented 3 years ago

when we think "regexp", it might be best to turn to ppcre.

Regex is just a representation of an arbitrary string. The most flexible way to represent a string. Regex is not necessarily bound to ppcre. It just happens to be that ppcre is the library that 'understands' them. However, ppcre is much more low-level than str is.

I don't care so much whether it is a regex to use for splitting as long as I can use an arbitrary string. (Insofar I would probably refrain from re-split, but just have a split). I.e. splitting a text file with Windows line endings I have to use this work around.

(str:split (str:join "" '(#\return #\newline)) 
           (str:join "" '("foo" #\return #\newline "bar")))

I see string splitting essential for string parsing, it is kind of a light weight alternative to capturing (which really is about regexes) but in order to be useable for parsing it must allow arbitrary strings for splitting.

mdbergmann commented 3 years ago

Or call starts-with-p but with a regexp?

That's a valid point. What about other functionalities like 'starts-with', or 'ends-with'. My take is that those are much less dependent on regular expressions than splitting is. Though it might still be necessary to supply a tab character to a 'starts-with' function. I'm not sure if there is any other way of encoding special characters in a string so that it can be applied in 'starts-with', 'split', without using a regex.

vindarel commented 3 years ago

Thanks for detailing your use case and motivation.

(Insofar I would probably refrain from re-split, but just have a split).

split with a :regex (:re? both?) key would be good for you? That looks good, we should do it.

I.e. splitting a text file with Windows line endings I have to use this work around.

Here probably str should help and provide specific variables or function parameters. So you would not look for a regexp, but use a built-in explicitely.

Though it might still be necessary to supply a tab character to a 'starts-with' function.

+1, we should be able to give a character to starts-with-p, as with other functions.

mdbergmann commented 3 years ago

Hi.

split with a :regex (:re? both?) key would be good for you?

I would choose :regex

kilianmh commented 1 year ago

So then str:split simply needs the :regex keyword parameter and an if clause like this?

(if regex
    (ppcre:split separator s :limit limit :start start :end end)
    (ppcre:split `(:sequence ,(string separator))
                 s
                 :limit limit :start start :end end)))

Or do we need more adjustments, such as support for the other ppcre:split parameters (with-registers-p, omit-unmatched-p, sharedp), or something else ? @mdbergmann @vindarel

vindarel commented 1 year ago

split, rsplit and split-omit-nulls with a :regex key argument is probably useful, although I didn't encounter the need.

An example I can think of:

(str:split "[0-9]+" "some987stupid123string" :regex t) ;; '(some stupid string)
ccqpein commented 11 months ago

I have the same ideas about improving the str:split today. Instead of using the regex, I think separator can be a list that contains all the separators. Like (str:split '(";" "," " ") "some;thing, stupid ")

But looks like the regex is the more general way to improve. I am happy with importing the :regex keyword.

Update: Gave a PR for split regex. https://github.com/vindarel/cl-str/pull/110

vindarel commented 11 months ago

thanks for doing it!

May it serve you well for advent of code ;)

ccqpein commented 11 months ago

Ah! @vindarel it is you made this tool. I thought the id is familiar!