Autograder holes - Githubissues

apsun commented 7 years ago

It is easy to trick the autograder into accepting an object that cheats the correctness checks. Here's two examples that bypass the scalar value check, tested on CS357 quiz 17:

# Type not being checked when is_number attribute is True
# https://github.com/inducer/relate/blob/master/course/page/code_feedback.py#L144
class Evil:
    def __init__(self):
        self.is_number = True
    def __abs__(self):
        return self
    def __rsub__(self, o):
        return self
    def __lt__(self, o):
        return True

answer = Evil()

# Same idea, since we subclassed one of the accepted types
# isinstance() will let us through. Might need to change the
# superclass based on what the accepted answer type is, since
# __rsub__ doesn't get called if they're the same type.
class Evil2(np.float64):
    def __abs__(self):
        return self
    def __rsub__(self, o):
        return self
    def __lt__(self, o):
        return True

answer = Evil2()

2017-03-13 01_41_16

Changing isinstance(x, (...)) to type(x) in [...] (and fixing the is_number check) would prevent this particular trick by disallowing subclasses, though I'm not sure if that would be secure either.

Sorry if you don't want this code floating around in public, let me know and I'll remove it ASAP.

fschr commented 7 years ago

@apsun After a lot of digging, it seems that it isn't possible to easily get around the built-in type function. I made a PR #296 which takes that approach to fixing this problem.

apsun commented 7 years ago

@fschr I believe all of the numpy types (np.int32, np.float32, etc) also need to be accounted for. AFAIK np.number is just an abstract base class.

inducer commented 7 years ago

cc @lukeolson @solomonik for entertainment (and full disclosure)

apsun commented 7 years ago

Another evil one I just found:

def hax(*args, **kwargs):
    return True
np.allclose = hax

Bypasses the check_numpy_array_allclose check. It seems that the code isn't being sandboxed at all. This one is particularly worrying:

import subprocess
print(subprocess.check_output(["ls", "/"]).decode("utf-8"))

inducer commented 7 years ago

Thanks for taking an interest, for responsibly disclosing this, and for even helping to fix the underlying issue! I'll comment more on the PR.

def hax(*args, **kwargs):
    return True
np.allclose = hax

Good one. Easy to fix though. We should just grab a reference to the functions we need to trust (like np.allclose) before any user code runs.

Speaking of potential autograder holes, the one thing I'm most concerned about is stack traversals up to the code that runs the autograder. That thing has the correct code floating around, which is a legitimate concern. If you'd like to hep fix that, i would totally even pay you hourly for that. :) The one solution I can think of is taking the grading into a separate process--but I wonder if we can afford the overhead.

This one:

import subprocess
print(subprocess.check_output(["ls", "/"]).decode("utf-8"))

looks concerning, but isn't. All the autograder code runs in short-lived Docker containers, so the real barrier are the Linux kernel's network/filesystem/process namespaces. That has occasional holes, too, but it's robust enough that people run businesses on it, so it's good enough for me. Plus if someone compromises the container, they get a pretty dinky VM somewhere in the VM farm with no access to anything privleged.

Fundamentally though, autograding a language like Python is a tradeoff between bullet-proof-ness and performance...

inducer commented 7 years ago

The one solution I can think of is taking the grading into a separate process--but I wonder if we can afford the overhead.

It's also problematic because it makes passing callables from user code to grading code basically impossible.

stack traversals up to the code that runs the autograder

Someone could also just hijack the socket by which the autograder runner talks to the Relate mothership. Or wholesale replace the bytecode of the grading code. Or...

The one reason I can sit here somewhat calmly and discuss all these holes is that they're all imperfect crimes. Every submission is logged, so even if we can't perfectly prevent it, we can at least reconstruct what happened after the fact.

apsun commented 7 years ago

Oh, cool, I wasn't aware that the VM wasn't persistent.

As for the function replacing thing, it should be possible to use importlib.reload to "undo" all user changes to the module, to avoid having to save each and every function (plus some numpy functions have internal dependencies, but I think anyone capable of bypassing that could probably do the homework themselves ;-p).

Sandboxed imports can be done with something like this:

def make_safe_environ(allowed_imports=["numpy", "sympy", "matplotlib"]):
    def safe_import(name, globals=None, locals=None, fromlist=(), level=0):
        if name not in allowed_imports:
            raise ImportError("Illegal import: %s" % name)
        return __import__(name, globals, locals, fromlist, level)

    new_builtins = dict(__builtins__.__dict__)
    new_builtins["__import__"] = safe_import
    environ = {"__builtins__": new_builtins}
    return environ

And using exec(code, make_safe_environ()) (surrounded with a try-except to prevent exception bubbling of course) to run the user submitted code. It's probably not perfect, but I imagine most people won't be able to get around it.

Edit: On second thought, numpy and the other dependencies probably rely on other modules too. This probably wouldn't work too well.

inducer / relate

Autograder holes #295