jsonrainbow / json-schema

PHP implementation of JSON schema. Fork of the http://jsonschemaphpv.sourceforge.net/ project
MIT License
3.55k stars 354 forks source link

Two possible improvements to type coercion #379

Closed gtuccini closed 7 years ago

gtuccini commented 7 years ago

It would be useful to have a coercible string representation for null. "null" or '' would work just fine.

I think also that the coercion should happen only when the subject doesn't match any of the allowed types. Currently, given the following subject

{
    "value": "true"
}

and the following schema

{
    "properties": {
        "value": {
            "type": ["boolean", "string"]
        }
    },
    "type": "object"
}

"value" is coerced to boolean, even if its type -- string -- is among the allowed ones.

What do you think?

erayd commented 7 years ago

@shmax - type coercion is yours; what do you think of this? IMO both are excellent ideas.

shmax commented 7 years ago

Will definitely look into these issues this weekend. My first 10-second impression is that coercing to null for a string would violate the type (as null is not a string), so at best it would only be valid if you had some kind of mixed type in your schema. Not sure about the second point, but I can tell you that the current logic we have in place doesn't really have any opinion about the order, so adding some logic like you suggest probably would be no worse. More soon. Thanks for the contribution.

erayd commented 7 years ago

Re 10 second impression - isn't that the point of the coercion feature in the first place? If coercing "true" to a boolean is OK, that feels fairly equivalent to coercing "null" (or an empty string) to a null.

shmax commented 7 years ago

Hmm, well I don't think he really gave us enough detail about what he wants for the null case, but I hope we can agree that coercing string "null" to actual null for type string is a non-starter, because again, you would be coercing to something that is the wrong type. And further, I think you would be asking for a lot of bugs in your client app. Suppose you have a form for validating user names on your social site and a user happens to want to call himself "null". Perfectly legitimate string, but it would get converted to null by the coercion feature and cause a black hole somewhere.

The only coercion case involving strings and the type "null" that I can think of that would make sense would be to coerce empty string to type null. Our pals over at ajv seem to agree: http://take.ms/QYQEo

Anyway, I would be happy to add that feature if it would help. Will get back to you on the other point.

erayd commented 7 years ago

...I hope we can agree that coercing string "null" to actual null for type string is a non-starter...

Oh, I certainly wasn't suggesting that - we agree there completely! If the schema demands a string, and the document being validated supplies a string, then it should be left well alone.

What I meant (and what I hope the OP was asking for) was that if the schema demands a null (i.e. defines something with the explicit type null), but the document contains a string "null" (or an empty string), then type-coercing that string into an actual null to force compliance seems reasonable, given that type-coercion is what the user asked for when they enabled CHECK_MODE_COERCE_TYPES.

erayd commented 7 years ago

For the record, I disagree with ajv on not coercing "null" to null, on the grounds that "true" and "false" are already considered valid targets for coercion, and "null" is qualitatively exactly the same type of situation.

erayd commented 7 years ago

Do we care about coercion reversibility? That could be an argument against coercing "null".

shmax commented 7 years ago

Hmm, I just don't know. You could be right. One thing to consider is that the stringification of null is sort of language dependent, and it's only a keyword in certain languages. PHP doesn't seem too keen to do it:

echo http_build_query(array(
  'foo'=>'bar',
  'baz'=>'boom',
  'cow'=>null
));
// "foo=bar&baz=boom

but if you want to force it to a string, it comes out in caps:

var_export(null); // "NULL"

Javascript does:

console.log(encodeURIComponent(null)); // "null"

Python (not an expert, apologies):

from urllib.parse import urlencode
url = urlencode({'pram1': 'foo', 'param2': None})

print(url) # "param2=None&pram1=foo"

So I guess I wonder if teaching json-schema about "null" might be pushing things a little too far away from language agnosticism...

erayd commented 7 years ago

I do feel that this one is your call - it's 'your' feature, you did all the initial research and work around it, and it's not something I use myself.

My personal opinion is that we should coerce when (and only when) the schema defines a type, validation would fail without coercion, and the mapping from the input data to the correct value of the schema type is obvious. But I'd like to defer to your preference here, whatever that may be.

...but if you want to force it to a string, it comes out in caps...

var_export isn't returning a stringified value, it's returning valid PHP code (you'll note that NULL is not quoted). Caps follows convention, but is not required - NULL is case-insensitive in PHP.

So I guess I wonder if teaching json-schema about "null" might be pushing things a little too far away from language agnosticism...

That's a really good point, although again I feel it falls into the same basket as coercing "true" / "false" to boolean. If we refuse to support "null", then why should we support those? The only argument I can think of to not support "null", given the boolean thing, is that the boolean coercion is directly reversible, whereas "null" is not. So it would depend whether reversibility matters here.

shmax commented 7 years ago

NULL is case-insensitive in PHP

That was exactly my point. "null" and "NULL' are equivalent, so if we support coercing "null" then sooner or later someone is going to ask us to coerce 'NULL', and then after that someone will ask for "None", then "nil", and on, and on. Now we have this wad of strings that we will consider to be "null", thus weakening the "null" type's ability to do its job (suppose you're on a Python box, and a request comes in from some PHP client with "null". Rather than catch what is clearly an invalid value in the Python environment, it passes coercion with flying colors).

That's a really good point, although again I feel it falls into the same basket as coercing "true" / "false" to boolean. If we refuse to support "null"

Because "true" and "false" are fairly universal. I don't know of any major (non-deliberately oddball) languages that don't have them as keywords. "null" is a keyword in some languages, but not all, and even the ones that do use it don't necessarily stringify it when building query params.

Rather than beat it to death, why don't we start with coercing empty string to null, and go from there?

gtuccini commented 7 years ago

I'll try to explain my use case a bit.

Suppose you have a list of entities, whose attributes comprises the (integer) attribute "rank".

The list is accessible at entities.html and by default lists all the entities - i.e. there is no filter set on "rank".

You can optionally set or remove a filter on "rank" through a GET form, which generates the following urls:

I'd like to coerce and validate the query string using the following schema:

{
    "properties": {
        "rankisgreaterthan": {
            "default": null,
            "type": ["null", "integer"]
        }
    },
    "required": ["rankisgreaterthan"],
    "type": "object"
}

For this to work empty-string has to be coerced to null. There is no ambiguity, however, because null is explicitely listed among the expected types.

shmax commented 7 years ago

Okay, so that's coercing empty string to null, which should be fine, right?

gtuccini commented 7 years ago

I just read your last comments. I concur with your preference for '' :)

erayd commented 7 years ago

Rather than beat it to death, why don't we start with coercing empty string to null, and go from there?

This sounds like a great idea - you're making really good points re the various strings that might represent null, so maybe that particular case is best left for now, until it's had more thought / someone specifically asks for it.

shmax commented 7 years ago

I think also that the coercion should happen only when the subject doesn't match any of the allowed types.

@gtuccini I'm still mulling this one. I guess I'm a little on the fence. Let's say we have a form on a website called "age". You can either enter a number, like "46", or a string representing your birth date, such as "1/1/1971". So, in my schema, I have this:

{
    "properties": {
        "value": {
            "type": ["integer", "string"]
        }
    },
    "type": "object"
}

On my back end I turn the coercion option on and validate the form. Post-coercion, I'm expecting either an integer 45 or string "1/1/1971", but with your idea I would get string "45", which seems to go against the spirit of the thing.

Can you provide any realistic counter examples?

erayd commented 7 years ago

@shmax

Can you provide any realistic counter examples?

A user enters 1971 into the form. It's impossible to tell whether this is a year, or an age, without fully understanding the concept of age, which is outside the scope of JSON schema. Similarly, 45 may refer to someone being born in 1945, rather than being 45 years old.

Personally, I feel that coercing an already-valid type opens up far too much room for ambiguity and getting things wrong - there's a point where it's more appropriate to validate this kind of thing using "pattern", or have some business logic in the application handle it - I think this one is on the "this should not be our problem" side of the line.

If you feel strongly in favour of coercing already-valid types though, what about adding e.g. CHECK_MODE_AGGRESSIVE_COERCION, so that this kind of behavior doesn't surface unless the user explicitly asks for it?

shmax commented 7 years ago

Heh, okay, so you're shooting holes in my sample test case. Fair enough. Let's try a different one. We have a field called "month". You can enter a number, such as "1", or a name, such as "February". We have the same situation, and we don't have to bring pattern into it or get sidetracked into the different ways one can represent a date.

shmax commented 7 years ago

Another idea would be to stop validating once a type validates (it might already work this way, I'll have to check). Then the schema designer can sort of game the coercion rules to his taste by putting his list of types in the order that he would prefer them to have precedence. In other words, you could use a system like this to short-circuit in your first example by putting "string" first. Someone like me who would rather coerce boolean straightaway would put it first in the list.

erayd commented 7 years ago

That still requires an understanding of what a month is, which JSON schema does not have. How do you propose to distinguish between your month example and the age example? Because you cannot make a rational decision about what to do with the value unless you can clearly and consistently know what the user is expecting to receive.

In lieu of such understanding, assuming that schema-compliance is sufficient seems reasonable, and reduces the potential for confusion.

Another idea would be to stop validating once a type validates (it might already work this way, I'll have to check). Then the schema designer can sort of game the coercion rules to his taste by putting his list of types in the order that he would prefer them. In other words, you could use a system like this to short-circuit in your first example by putting "string" first. Someone like me who would rather coerce boolean straightaway would put it first in the list.

I'll need to think about this one a bit more, but my initial instinct is deeply opposed to that kind of thing. IMO extending the schema to include a "typePreference" attribute would make more sense - schema extensions are legal (if not portable), and would clarify the situation somewhat.

erayd commented 7 years ago

To clarify - nothing wrong with stopping once a type validates; it's the "gaming things by tweaking the order" that I have an issue with.

erayd commented 7 years ago

A thought - do you know whether there has been anything regarding type-coercion proposed for the official spec? Because if you want to have this kind of thing widely adopted, and it hasn't already been discussed, proposing it there may be worthwhile.

shmax commented 7 years ago

That still requires an understanding of what a month is, which JSON schema does not have. How do you propose to distinguish between your month example and the age example? Because you cannot make a rational decision about what to do with the value unless you can clearly and consistently know what the user is expecting to receive.

I don't follow. Post-coercion, I'm expecting a numeric value an integer, which I will interpret as an ordinal value, and anything else to be some kind of natural language descriptor. With the proposed change an input of "5" would get me neither, which again, seems to contradict the whole point of coercion. I'm not suggesting that my use case is universal or always correct, but I am interested in seeing @gtuccini's real use case that prompted him to raise the issue.

shmax commented 7 years ago

I'll need to think about this one a bit more, but my initial instinct is deeply opposed to that kind of thing.

You're deeply opposed to lists of things, or lists of things in preferential order?

erayd commented 7 years ago

I'll try to rephrase.

If the user has supplied a schema that expects a type of either "integer" or "string", they are explicitly saying that either an integer or a string is an acceptable value. No more, no less.

If the value supplied is something other than an integer or a string, but can be coerced in an obvious way to one of those types to avoid failing validation, then this is a good thing, and as it's behind CHECK_MODE_COERCE_TYPES there's no risk of surprising the user with it, as it's what they asked for.

If the value supplied is an integer or a string (i.e. what the user asked for), and we mess with it anyway, then we're second-guessing the user's own business logic, which we should never, ever do. They have not told us what they intend to do with the values, there is no way for us to infer what they want to do with the values, and understanding their business logic is well outside the scope of what JSON schema is for. Randomly mutating valid data without being asked, especially when we don't have a reasonable basis for doing so, is not something a validator should ever do, under any circumstances.

erayd commented 7 years ago

You're deeply opposed to lists of things, or lists of things in preferential order?

I'm deeply opposed to "gaming the system" by trying to hack business logic into a schema-validator via the element order.

shmax commented 7 years ago

I'm deeply opposed to "gaming the system" by trying to hack business logic into a schema-validator via the element order.

Really? I guess I'm having trouble mustering any outrage over it, mainly because it's already happening in some form; I believe we're walking over the types in the order that they're encountered, and the first one that can be coerced will be coerced, which will affect the following type's chances for doing further coercion.

The nice thing about the idea is that I believe both of our use cases can be supported (if we stop looping over our types once something validates--really need to look at the code), at least in my month example. If I want my integer coercion to always trump the string, then I make sure it's first in the list, then things work the way I like. If @gtuccini wants it the other way, then he just puts string first, and then no coercion needs to happen and a string is what he gets.

erayd commented 7 years ago

I guess I'm having trouble mustering any outrage over it, mainly because it's already happening in some form...

"Because we're already doing it" isn't synonymous with "we should be doing it" - IMO if we find a problem (and I feel that this one is quite serious), we should fix it, just like any other bug.

I believe we're walking over the types in the order that they're encountered, and the first one that can be coerced will be coerced, which will affect the following type's chances for doing further coercion.

Then we shouldn't be doing that. It's violating what the schema says is acceptable - if the schema says something is OK, then we should not be second-guessing it.

The nice thing about the idea is that I believe both of our use cases can be supported (if we stop looping over our types once something validates--really need to look at the code), at least in my month example. If I want my integer coercion to always trump the string, then I make sure it's first in the list, then things work the way I like. If @gtuccini wants it the other way, then he just puts string first, and then no coercion needs to happen and a string is what he gets.

This is just taking advantage of a bug to obtain intended behavior. It works, but there be dragons down that road - once you start doing it, then you can't ever fix stuff, because you don't know what behavior might be relying on that bug. It's better to fix the bug and implement the desired behavior properly.

shmax commented 7 years ago

Well, I'm not totally sure I agree yet that there is a bug. I don't have a huge sample set to work with here, or anything, but the way we have things now aligns with the two other coercive validation libraries I'm familiar with. Can we dial back the intensity and outrage a little bit and just sort of take it easy until we have a few more use cases from @gtuccini? It is still my day off, after all :)

erayd commented 7 years ago

Certainly. I feel strongly about it, but quite happy to let this sit for a while :-).

gtuccini commented 7 years ago

I can't provide a detailed use case right now. I noticed the "issue" by comparing the implementation of type coercion in this library with the implementation in ajv, which I'm using client side. I advanced my suggestion because I have a pratical interest in the alignment of the two implementation, to avoid bugs, and -- from a theoretical point of view -- I just think that type coercion should be a (useful) last-resort option and shouldn't touch the already valid values. I want to stress, anyway, that I can live with the current implementation and that I don't consider it wrong, given that there is nothing in the standard about type coercion :)

shmax commented 7 years ago

I noticed the "issue" by comparing the implementation of type coercion in this library with the implementation in ajv, which I'm using client side

Ah, now that's interesting. I had thought that ajv uses our model. Do they do what you describe?

gtuccini commented 7 years ago

The documentation at https://github.com/epoberezkin/ajv/blob/master/COERCION.md says "Type coercion only happens if there is type keyword and if without coercion the validation would have failed.... If there are multiple types allowed in type keyword the coercion will only happen if none of the types match the data and some of the scalar types are present". I performed some tests and it seems to work in the way I described.

shmax commented 7 years ago

Yeah, I just tried it myself. Okay. I'm willing to hook it up, but first I'll have to see if I can talk @erayd into it (ha, ha)

shmax commented 7 years ago

Actually, @erayd, any interest in tackling this one? I'll willing, of course, but you seem to be pretty passionate about it, and I'm in the middle of Nioh.

shmax commented 7 years ago

I'm thinking it should be a 6.0 type of thing, since it is a behavioral change.

erayd commented 7 years ago

@shmax I'm not too fussed who does it, as long as we are on the same page. Quite happy to do the work if you'd like :-).

Just to make sure we are intending the same thing here, you are proposing that I write a PR to change the behavior of coercion, such that coercion only happens when validation would otherwise fail?

I'm thinking it should be a 6.0 type of thing, since it is a behavioral change.

I agree.

shmax commented 7 years ago

Just to make sure we are intending the same thing here, you are proposing that I write a PR to change the behavior of coercion, such that coercion only happens when validation would otherwise fail?

Yep. And if you're busy with something else (like the errors revamp) or something, no sweat, I'll do it, but I just thought you might enjoy this one, as it sounds kind of interesting.

shmax commented 7 years ago

And if you're feeling ambitious, you can think about filling in some of the other coercion types from ajv's grid that we don't currently do.

erayd commented 7 years ago

Sweet, I'll do it then :-). Will give me a chance to consider the coercion code in more depth - familiarising myself with the codebase is part of the reason why I wrote all those unit tests, and this would fit neatly into the same category.

And if you're feeling ambitious, you can think about filling in some of the other coercion types from ajv's grid that we don't currently do.

Sure.

And if you're busy with something else (like the errors revamp) or something...

I am in the middle of a PR to make this library do strict validation that properly conforms to the version of the spec being used (this is the first bit of that). However, I'm about ready to do something else for a while - I'm getting sick of reading the spec documentation to find discrepencies between versions, so tackling the coercion thing will be a nice change :-).

shmax commented 7 years ago

Groovy! Off you go, then. You have {time:"1"} hours

shmax commented 7 years ago

Oh, and here's a fiddle to play with if you want to check alignment with ajv: https://jsfiddle.net/8egbgv3b/

erayd commented 7 years ago

Thanks :-)

erayd commented 7 years ago

{time:"1"}

Syntax error, thankfully, so I can have my lunchbreak first while the clock is broken ;-).

shmax commented 7 years ago

It's a valid javascript object literal. Now get back to work.

erayd commented 7 years ago

Sigh... I knew there was a reason for that ominous countdown beeping...

shmax commented 7 years ago

💀

gtuccini commented 7 years ago

@erayd While you're at it, would you give some thought to the "typePreference" attribute you mentioned earlier? The "age/date" use case described by shmax is certainly relevant and it will no longer be possible to specify a preference through the order of the types. I'm a bit annoying, I know, but I'll gladly help if you wish :P

erayd commented 7 years ago

@gtuccini Definitely worth thinking about, I agree. Not in the same PR, but I'll see if I can come up with a nice way of achieving it.

gtuccini commented 7 years ago

Thanks, you are both my heroes. It's 2.15 am here in Italy, so I'll say bye for now :)

shmax commented 7 years ago

buona notte.