python-jsonschema / jsonschema

An implementation of the JSON Schema specification for Python
https://python-jsonschema.readthedocs.io
MIT License
4.63k stars 581 forks source link

`uniqueItems` check only validates the first two entries for types with a `__len__` implementation #866

Closed omaskery closed 3 years ago

omaskery commented 3 years ago

After updating to a newer version of jsonschema, we now find that it does not notice violations of the uniqueItems constraint on lists.

By stepping through the code, I think I've found the issue in this code (originally introduced in this commit in this PR, but now found here):

        sort = sorted(unbool(i) for i in container)
        sliced = itertools.islice(sort, 1, None)

        for i, j in zip(sort, sliced):
            return not _sequence_equal(i, j)

Notice that the for loop will always return on the first iteration, without examining subsequent values.

This code looks like it's supposed to be seeing if any two successive elements is the same, but with the return statement it would only detect unique constraint violations between the first two elements.

I think this bug has escaped notice so far for two reasons:

I wonder if this code is instead supposed to be:

         sort = sorted(unbool(i) for i in container)
         sliced = itertools.islice(sort, 1, None)

         for i, j in zip(sort, sliced):
-            return not _sequence_equal(i, j)
+            if _sequence_equal(i, j):
+                return False

For what it's worth, here's some noddy test code I put together to explore the issue:

def test(container):
    print(f"uniq({container}) = {uniq(container)}")

TEST_CASES = [
    [],
    [1],
    [1, 2],
    [1, 2, 3, 4, 5],
    [1, 2, 3, 3, 5],
    [1, 1, 3, 4, 5],

    ["a"],
    ["a", "b"],
    ["a", "b", "c", "d", "e"],
    ["a", "b", "c", "c", "e"],
    ["a", "a", "c", "d", "e"],
]

for test_case in TEST_CASES:
    test(test_case)

And here is the output I got:

uniq([]) = True
uniq([1]) = True
uniq([1, 2]) = True
uniq([1, 2, 3, 4, 5]) = True
uniq([1, 2, 3, 3, 5]) = False
uniq([1, 1, 3, 4, 5]) = False
uniq(['a']) = True
uniq(['a', 'b']) = True
uniq(['a', 'b', 'c', 'd', 'e']) = True
uniq(['a', 'b', 'c', 'c', 'e']) = True
uniq(['a', 'a', 'c', 'd', 'e']) = False
Julian commented 3 years ago

Thanks for the detailed report. Any chance you're interested in a PR? The first step would be adding some tests to the official test suite here: https://github.com/json-schema-org/JSON-Schema-Test-Suite/blob/master/tests/draft2020-12/uniqueItems.json (along with the 2019 draft).

After that your suggested change looks like what was intended as well.

DrGFreeman commented 3 years ago

@Julian, I'm interested in submitting a PR in the official test suite at https://github.com/json-schema-org/JSON-Schema-Test-Suite. I got changes to the tests that can trigger the current issue. My understanding is that I should make the changes in the tests of all drafts that support the uniqueItems attribute (all)? I'm asking because here you refer to the 2020-12 and 2019 drafts only.

Julian commented 3 years ago

Great! Yes please, I believe IIRC all drafts have uniqueItems

DrGFreeman commented 3 years ago

@Julian, I see that the changes from the test suite are already merged here. If it's OK with you, I'll work on a PR to fix the current issue.

Julian commented 3 years ago

Yep, sounds great thanks again!