haskell / text

Haskell library for space- and time-efficient operations over Unicode text.
http://hackage.haskell.org/package/text
BSD 2-Clause "Simplified" License
408 stars 159 forks source link

Make `Data.Text.replace` total #425

Open sol opened 2 years ago

sol commented 2 years ago

Data.Text.replace errors if the needle is the empty string. Every other implementation I tried matches the empty string against every position in the string. I suggest that for consistency with other languages and to avoiding partiality we adopt that behavior.

Expected behavior:

ghci> replace "" "x" "foo"
"xfxoxox"

Actual behavior:

ghci> replace "" "x" "foo"
"*** Exception: Data.Text.replace: empty input
CallStack (from HasCallStack):
  error, called at libraries/text/src/Data/Text.hs:1862:18 in text-1.2.5.0:Data.Text
Rust:
fn main() {
    println!("{}", "foo".replace("", "x")); // prints "xfxoxox"
}
$ rustc main.rs && ./main
xfxoxox
Python:
>>> 'foo'.replace('', 'x')
'xfxoxox'
JavaScript:
> 'foo'.replaceAll('', 'x')
'xfxoxox'
Bodigrim commented 2 years ago

I'm sympathetic to the idea, but technically this is a breaking change.

Bodigrim commented 2 years ago

@sol could you please gather community feedback on this change?

sol commented 2 years ago

@Bodigrim I would love to help, however, even if I try hard, I'll not be able to free up time to mediate this issue.

If somebody else wants to pick it up, this might be useful:

The reason why I think this change is a net positive is that relying on the semantics of pure exceptions is inherently risky:

https://gitlab.haskell.org/ghc/ghc/-/issues/1171 https://gitlab.haskell.org/ghc/ghc/-/issues/2273 https://gitlab.haskell.org/ghc/ghc/-/issues/5561 https://gitlab.haskell.org/ghc/ghc/-/issues/5129

Read: If your program relies on the type of pure exceptions, it may produce different results depending on GHC version and optimization level. For that reason some people will advice that you should only ever use a partial function if you make sure that your input will not trigger an exception.

Observations:

  1. Programs that follow that advice will not be affected by the proposed change.
  2. Programs that did not follow that advice may have their semantics changed, from possibly non-deterministic behavior to behaving deterministic.
Lysxia commented 2 years ago

This is a very benign breaking change. People push back against changes that cause unnecessary churn, but nobody depends on replace "" _ _ being undefined. I will try to collect a few more such changes so we can ask/notify about them all in one post (there is a similar "nobody should care, but do you" conundrum in improving foldl).


I've confirmed that every occurrence of Text.replace on Hackage (~50 packages) is ostensibly applied or meant to be applied to a non-empty string. Almost all of them are a literal or some obvious construction like Text.singleton _ or "foo" <> _ <> "bar", so they're quick to go through. The few other uses are a little less obvious, but it is still clear that the empty string is not expected; listed below for exhaustiveness:

  1. nixpkgs-update: https://github.com/ryantm/nixpkgs-update/blob/934b497d8e32a42238a9525126614ad74142da78/src/File.hs#L48
  2. yesod-form: https://github.com/yesodweb/yesod/blob/99c1fd49a33cc3a87d07e898ab7b782f9f3bf679/yesod-form/Yesod/Form/Functions.hs#L686
  3. sdp4text defines a class method replaceBy = replace, which I couldn't find any use of on Hackage https://github.com/andreymulik/sdp4text/blob/12990eaa7b39ff1cad7bd52be26986fa0bd0a30d/src/SDP/Text.hs#L172
  4. rosmsg, a 6-year-old package, https://github.com/RoboticsHS/rosmsg/blob/9dd9e32fcf544ef75b1d2823f27e31076df396c9/src/Robotics/ROS/Msg/MD5.hs#L33
sergv commented 2 years ago

I've noticed that breakOn and breakOnEnd also throw exceptions when their pattern is an empty string. Surprisingly this is not mentioned in the documentation and can only be inferred from the presence of HasCallStack constraint.

I think it might be reasonable to make breakOn and breakOnEnd total. After all, breaking on empty string once can be treated as not doing anything - there's an empty string at the start and end of any string. So something like breakOn "" str = ("", str) and breakOnEnd "" str = (str, "") would make those functions total and probably shouldn't be too surprising.

There's also breakOnAll which is also partial, but at least this one mentions that the pattern to break on may not be empty in the docs. However I think it could be usefully defined on empty pattern as well - just explode original text into individual characters, breakOnAll "" str = map singleton (unpack str). At first glance this even seems to be consistent to what's written in the docs of breakOnAll, which is Find all non-overlapping instances of needle in haystack - all non-overlapping occurrences of an empty string in 'gaps' between characters got broken on.

tbidne commented 2 years ago

For comparison, splitOn (also partial) has this to say:

String to split on. If this string is empty, an error will occur.

An empty delimiter is invalid, and will cause an error to be raised.

So I would suggest either making all of these total (my preference) or at least copying splitOn's documentation.