metosin / malli

High-performance data-driven data specification library for Clojure/Script.
Eclipse Public License 2.0
1.46k stars 208 forks source link

re schema doesn't properly validate values with line breaks #862

Open lasiltan opened 1 year ago

lasiltan commented 1 year ago
(malli.core/validate [:re #"^thing$"] "thing\n")
=> true

Compare to

(re-matches #"^thing$" "thing\n")
=> nil
lasiltan commented 1 year ago

This ultimately boils down to the difference in java.util.regex.Matcher methods find and matches (clojure.core/re-findvs. clojure.core/re-matches).

(let [pattern (Pattern/compile "^thing$")
      matcher (.matcher pattern "thing\n")]
  (.matches matcher))
=> false

(let [pattern (Pattern/compile "^thing$")
      matcher (.matcher pattern "thing\n")]
  (.find matcher))
=> true

The javadoc on java.util.regex.Matcher/find states:

     * @return  {@code true} if, and only if, a subsequence of the input
     *          sequence matches this matcher's pattern

whereas the javadoc for java.util.regex.Matcher/matches states:

Attempts to match the entire region against the pattern.

To me, at least, the functionality of java.util.regex.Matcher/matches, i.e clojure.core/re-matches, is what I'm expecting.

ikitommi commented 1 year ago

Interesting. Tested with few alternatives:

Plumatic

(require '[schema.core :as schema])
(schema/validate #"thing" "thing\n")
; => "thing\n" (match)

JSON Schema Spec

{"type": "string", "pattern": "^thing$"}
// => null (match)

Javascript

RegExp('^thing$').exec('thing\n')
// => null (no match)

So, Malli works currently like Plumatic and JSON Schema do, but not like default JavaScript.

ikitommi commented 1 year ago

Do you have more external references (on languages / libraries etc) on what is the correct way to handle this? Options:

  1. keep the current way as others (JSON Schema + Plumatic) are doing it like this
  2. change the default as it's obviously less correct, need to handle JSON Schema mapping
  3. new option to m/-re-schema to override this, e.g. users can configure this
camsaul commented 1 year ago

I think the behavior for re-find is definitely a little sneaky with it allowing a newline at the end, but I think using re-find behavior is more intuitive than re-matches behavior.

For example if I have the schema

[:re #"\dcans"]

I would expect that to match any string that contained <num>cans anywhere in the string, e.g. one can, 2cans, 3cans or something like that. If I wanted my string to start with <num>cans, then I'd use ^. re-matches behavior is basically forcing ^ and $ semantics when sometimes that's not what you want. You can always opt in to that behavior by using ^ and $.

camsaul commented 1 year ago

I should also mention that you can always use \z instead of $ if you actually want the end of the input, $ is technically The end of a line and \z is the end of the input.

(re-find "^string\z" "string\n") => nil

See https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html

lasiltan commented 4 months ago

I didn't even know about \z. I think that's good enough for me.