voxpupuli / json-schema

Ruby JSON Schema Validator
MIT License
1.55k stars 243 forks source link

URIs are open()ed for no reason #148

Open pd opened 10 years ago

pd commented 10 years ago

I found this while working through more of the common test suite failures. The draft4/optional/format.json test currently pauses during a URI validation test because it's trying to make a request to load data from the URI. You can reproduce this with:

pry(main)> data = 'http://foo.bar/?baz=qux#quux'
=> "http://foo.bar/?baz=qux#quux"
pry(main)> schema = '{ "type": "string", "format": "uri" }'
=> "{ \"type\": \"string\", \"format\": \"uri\" }"
pry(main)> JSON::Validator.validate(schema, data)

The final line will eventually return true (which is correct), but if you wrap it in Timeout.timeout(15) { ... } or whatever, it reliably throws an exception.

I've tracked this down to one of these open() calls inside Validator#initialize_data, but haven't yet had a chance to dig in any further to figure out when those are necessary and when they aren't:

https://github.com/hoxworth/json-schema/blob/8daf3de35ab834bc9c2325a967be25322b909cdd/lib/json-schema/validator.rb#L584-L598

iainbeeston commented 10 years ago

I was looking at that code today (see #147).

When you run JSON::Validator.validate the data is passed to initialize_data. If your data is a string, it will assume that the string contains json (e.g. "{ \"type\": \"string\", \"format\": \"uri\" }") and uses a json parser to parse it into a hash. If that parsing throws an error (any kind of error at all...*) it stops and instead tries to parse the string as a uri, and loads data from that uri (using open()).

In your case it will try to load json from 'http://foo.bar/?bar=qux#quux'. When I run this in irb it does take a long time.

irb(main):001:0> require 'open-uri'
# => true
irb(main):002:0> open('http://foo.bar/?baz=qux#quux')
# looooong wait here, then...
# Errno::ETIMEDOUT: Operation timed out - connect(2) for "foo.bar" port 80
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/net/http.rb:879:in `initialize'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/net/http.rb:879:in `open'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/net/http.rb:879:in `block in connect'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/net/http.rb:878:in `connect'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/net/http.rb:852:in `start'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:313:in `open_http'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:724:in `buffer_open'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:210:in `block in open_loop'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:208:in `catch'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:208:in `open_loop'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:149:in `open_uri'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:704:in `open'
#   from /Users/iain/.rubies/ruby-2.1.3/lib/ruby/2.1.0/open-uri.rb:34:in `open'
#   from (irb):2
#   from /Users/iain/.rubies/ruby-2.1.3/bin/irb:11:in `<main>'

The uri could be changed to something else, but I'm not sure what to suggest without knowing what the test is doing. Or perhaps json-schema should set it's own timeout?

* Probably not ideal but I have plans for that

pd commented 10 years ago

It's not clear to me why initialize_data is trying to load remote data at all; shouldn't we only fetch a URI if there's a schema we're trying to load from it (eg, it was the value of a $ref property)?

json-schema definitely shouldn't make requests to any arbitrary URI that just-so-happens to be embedded in an object. For example, my common use-case with rack-schema is for validating API responses, which have a ton of URIs inside of them:

{
  "widget": { "manufacturer_id": 1, "size": 10, "price": "99.99" },
  "links": { "manufacturer": "https://some.api/manufacturers/1" }
}

I haven't had a chance to test locally, but I think that means we're actually hitting those endpoints during validation?

hoxworth commented 10 years ago

@pd We should most definitely only be hitting a URI if there's a schema trying to load from it in a $ref - if we are hitting URIs in the data, that's a definite bug and security concern. I'm inclined to think this is not occurring, because to do such would imply that we're checking essentially all values for URI presence; however, we can't write such a possibility off.

I'll take a look at this today as well.

hoxworth commented 10 years ago

By the way, autoloading of ref schemas is something I want to make optional in version 3 of this library.

iainbeeston commented 10 years ago

Oh I see your point.

I think you're misunderstanding here. Your schema definition says you expect a string, formatted as a uri. However, when you pass 'http://foo.bar/?baz=qux#quux' as data, json schema doesn't think you're passing it something to validate at all - it thinks your data is at that url, and tries to load it.

I suspect that if your data was {foo: 'http://foo.bar/?baz=qux#quux'} you'd get the behaviour you expect

iainbeeston commented 10 years ago

Basically, if you pass json schema a string, and it's not valid json (any arbitrary string is not necessarily valid json, to my knowledge - it must start with "{" and end with "}") json schema will assume it's a url, and try to load that. Your example isn't even hitting the validation code.

hoxworth commented 10 years ago

Ah, this makes more sense. Thanks for clarifying before I went on a huge witch hunt, @iainbeeston . Will monitor this before I dive into the code during the work day...

pd commented 10 years ago

The test that is triggering this is here: https://github.com/json-schema/JSON-Schema-Test-Suite/blob/develop/tests/draft4/optional/format.json#L23-L42

That entire file is based around the premise that the validator can be run on arbitrary JSON types; I think we have a lot of local tests that do the same -- eg, validate [] against {"type": "array", "minItems": 1} and such.

@iainbeeston's description makes sense and explains why my example above does not trigger a bunch of web requests. I think this is still a bug though; only schemas should cause the validator to (optionally) go fetch remote data, right?

Another example:

pry(main)> JSON::Validator.validate('{"format": "uri"}', 'https://api.github.com/users/pd')
=> true

If I understand this correctly, that triggers a GET to the Github API, then throw away the result and tell me that https://api.github.com/users/pd is a valid URI.

What's the use case for initialize_data ever calling open()?

iainbeeston commented 10 years ago

I imagine it's for convenience (@hoxworth?)

Right now, whether JSON schema can validate an arbitrary string depends on the json parser used. But I think this is a wider problem than just JSON schema itself. For example, JSON.parse('"foo"') raises an error for me (JSON.parse('["foo"]') and JSON.parse('{"foo": "bar"}') work). So if the JSON gem doesn't recognise a bare string as valid JSON then json schema is probably going to struggle.

hoxworth commented 10 years ago

Okay, there's a couple of things happening here...

As @iainbeeston points out, when the data is simply a bare string, it is not valid JSON as per the JSON spec. This library attempts to open the URI and use the data at the specified URI as the data to validate. If the data that is fetched is valid JSON and is parsable, the data is NOT discarded; otherwise, it is discarded and the original string is used to represent the JSON data.

In the case of the GitHub API fetch, the JSON returned actually does validate against the simple {"format":"uri"} schema, because the root element is an object, not a string. As per JSON Schema format validation rules, if an element is not of the type supported for a format, the validator should validate to "true".

The fact of the matter, however, is that we are attempting to validate invalid JSON against a JSON schema. I personally don't really like this, and think the common test suite should change and not expect validators to accept invalid JSON.

iainbeeston commented 10 years ago

That's a really interesting point. I've double checked the ietf spec and it's explicitly for validating json objects, not just arbitrary data. Which means that many of the common test suite tests are invalid (e.g. the format spec linked by @pd)

Does anyone want to raise the issue on the JSON-Schema-Test-Suite repo?

https://github.com/json-schema/JSON-Schema-Test-Suite/issues

hoxworth commented 10 years ago

@iainbeeston Yeah, I'll go open a ticket.

hoxworth commented 10 years ago

Issue opened on the common test suite: https://github.com/json-schema/JSON-Schema-Test-Suite/issues/64

fge commented 10 years ago

@iainbeeston

I've double checked the ietf spec and it's explicitly for validating json objects, not just arbitrary data

Are you talking about the JSON Schema spec? It does not say anything like this...

hoxworth commented 10 years ago

As @fge pointed out on the above issue, JSON Schema explicitly validates JSON values, not JSON texts. This only affects URIs at the present due to the magic resolution attempt, but is definitely something we need to support and fix.

(I still am not enamored with validating non-valid JSON texts, but whatever :smile: )

iainbeeston commented 10 years ago

@fge My mistake - you're right. I confused JSON data with JSON objects. If I read through it again and look up every definition:

  1. 'JSON Schema is a JSON media type for defining the structure of JSON data.'
  2. 'An instance is any JSON value. An instance may be described by one or more schemas. An instance may also be referred to as "JSON instance", or "JSON data".'
  3. 'The terms... "JSON value"... in this document are to be interpreted as defined in RFC 4627 [RFC4627].'
  4. 'A JSON value MUST be an object, array, number, or string, or one of the following three literal names: "false null true"' (in RFC4627)
iainbeeston commented 10 years ago

Also, (just for completeness, in case this ever helps anyone find the definition of what valid JSON is) RFC4627 has been superseded by RFC7159, which replaces point 4 with:

'A JSON value MUST be an object, array, number, or string, or one of the following three literal names: false null true. The literal names MUST be lowercase. No other literal names are allowed.'

pd commented 10 years ago

Possible option:

require 'json'
JSON.parse('1')
#=> JSON::ParserError: A JSON text must at least contain two octets!

JSON.parse('"http://foo.com"')
#=> JSON::ParserError: 757: unexpected token at '"http://foo.com"'

JSON.load('1') #=> 1
JSON.load('"http://foo.com"') #=> "http://foo.com"

require 'oj' # the only other json library i have installed atm
Oj.load('1') #=> 1
Oj.load('"http://foo.com"') #=> "http://foo.com"

But note that JSON.load has an explicit "BEWARE" comment! (edit: it looks like we could pass an explicit option to load and avoid the security issue -- http://docs.fedoraproject.org/en-US/Fedora_Security_Team/1/html-single/Secure_Ruby_Development_Guide/index.html#idm225463672800)

All of this is complicated by the validate API accepting both a string potentially containing valid JSON or an object that is the ruby equivalent of a potentially valid JSON object. eg, it's hard to know whether to treat a String as something that would validate against { "type": "string" } or something that we should first parse and then validate...

iainbeeston commented 10 years ago

I think that (maybe for v3) validate should assume it's a string to be validated.

I think if the user wants to load a url they should have to be explicit by using the uri hash parameter (eg. "uri: 'http://example.com/some.json'" argument). That parameter already exists, as a way of skipping the json parsing step and downloading the uri right away. Otherwise the string should be validated against the schema.=

hoxworth commented 10 years ago

Yes, currently we'll either break the API or the spec with any attempt to fix this issue. It's an unfortunate issue that most definitely needs to be addressed in v3, probably in the manner that @iainbeeston proposes above.

I'm beginning to think v3 is going to have to be a sooner-rather-than-later migration.

iainbeeston commented 10 years ago

While we’re on the subject of loading strings that happen to be a uri, it might be worth noting that if the string is interpreted (by URI.parse()) as a relative url then it’s turned into a file url. So for strings like ‘/foo/bar’, json-schema will try to load them from the filesystem (aside: I wonder if there’s potential there for an xss like attack if a server is validating json in http requests...)

pd commented 10 years ago

I think that (maybe for v3) validate should assume it's a string to be validated.

I was thinking basically the opposite; that we should expect to receive ruby objects for both schema and data. I typically use json-schema in rack apps, where I've often already got the JSON parsed, handled parse errors, etc. I wouldn't want to have to pay the cost of JSON parsing all over again each time I hit json-schema to validate the payload.

I think the problem is mostly that validate is overloaded, and we should just pick a default type for our inputs, then add an alternate method or options value to explicitly opt in to treating the input as a string or object or w/e. validate shouldn't have to apply heuristics, parse URIs, etc in order to guess what it should do with its inputs.

fge commented 10 years ago

Sorry to chime in, but do I understand there that "Ruby JSON" expects JSON Texts as defined by RFC 4627?

If yes, why not use an alternate JSON parsing library which parses JSON values instead?

iainbeeston commented 10 years ago

@pd I've been thinking about this, and now I agree completely. Ideally, we shouldn't try to parse json at all - there are already good libraries to do that - and just focus on the validation. This would remove a large amount of code, and any ruby developer that can call JSON::Validator.validate() can easily call JSON.parse() as well

iainbeeston commented 10 years ago

@fge Unfortunately all of the ruby json parsers that I'm aware of default to parsing json text, not json values.

However, I've just discovered that Ruby JSON has a "quirks mode" that will parse both json values and json text, e.g. JSON.parse('"hello world"', quirks_mode: true) == "hello world". This seems like it would be a better way of handling json parsing, in our case (although it's specific to JSON Ruby - yajl and multi json don't seem to have a quirks mode)

It's also interesting to see that JSON.load uses quirks mode by default and will load from any IO object, in a similar way to how json-schema is already able to load from a string, a file or a uri (which we turn into an io object using open()).

fge commented 10 years ago

@iainbeeston OK, I'll chime in again but something seems a bit off here; decomposing the flow, you have:

At least, that is how I view it. It looks like Ruby does things quite differently here...

iainbeeston commented 10 years ago

No, I think Ruby does it all as you suggested... It's just that in ruby, the default that has evolved is to parse json text, not just json values. The only way I've found (so far) to parse json values from a character stream is to use the ruby json parser, in quirks mode. Without quirks mode (ie. the default mode) it will raise an error if it tries to parse json values (as opposed to json text). Does that make sense? (Even if you disagree with that implementation?)

fge commented 10 years ago

@iainbeeston it makes sense and yes, I have trouble with this particular part:

the default that has evolved is to parse json text, not just json values

It was OK when RFC 4627 was the reference, but it isn't anymore ;)

pd commented 10 years ago

@iainbeeston I think the correct way for us to parse values is JSON.load('1', nil, :create_additions => false). We need to disable create_additions because it's a security issue with untrusted data:

[1] pry(main)> class Foo
  def self.json_create(data)
    puts "uhoh!"
    :hax
  end
end

[2] pry(main)> JSON.load('{"json_class": "Foo", "data": "hi mom"}')
uhoh!
=> :hax

[3] pry(main)> JSON.load('{"json_class": "Foo", "data": "hi mom"}', nil, :create_additions => false)
=> {"json_class"=>"Foo", "data"=>"hi mom"}

But we'll still have to figure out how this interacts with all the different MultiJson adapters. =\

I think you're right about just dropping support for parsing JSON, and we should only accept pre-loaded objects, so that it's the caller's responsibility to deal with the craziness! Which would mean this is finally unambiguous:

JSON::Validator.validate({ type: 'string' , format: 'uri' }, 'http://foo.com')
iainbeeston commented 10 years ago

@fge It seems like it's an open issue in Ruby JSON (https://github.com/flori/json/issues/206) but nobody is actively working on it

iainbeeston commented 10 years ago

@pd Apparently quirks mode does more weird stuff than just allowing top-level json values (see flori/json#206) - it sounds like what we're doing right now is probably closer to being correct (bizarrely)

kwstannard commented 8 months ago

I got alerted to this via a pen test.

We are using this library to validate user input against a schema. If the user puts in a URI as a value, then this will make a request to the URI.