ConnorNelson commented 2 years ago

Background

Currently pwn.college hosts a bunch of challenges primarily for the purposes of running ASU's CSE 466. As a result of it being open to the world, other non-ASU and non-466 students can work through the content themselves as well, which is great.

The next evolution of this is Private Dojos (#43). This feature has a number of use cases, but is primarily designed to allow someone to run their own course using the CSE 466 material. In creating a private dojo, users can choose to use only a subset of the challenges, and can also have isolated scoreboards. This is great because it means that users running their own course can very clearly communicate what challenges should be completed. Furthermore a private scoreboard easily allows for internal competition, rather than being lost away in the global scoreboard. As another use case, security clubs may find benefit in using private dojos for the same reasons.

At its core, pwn.college solves two separate problems.

The first is the course material. Designing hundreds of challenges that progressively build on each other in a way that attempts to progressively teach complex technical ideas is a hard problem. Already it has taken multiple years of iterative refinement, coming up with clever ways to automatically demonstrate what is going on (e.g. dumping memory addresses, the stack, rop chains, tcache, etc), and very slowly building out a concept one piece at a time (e.g. overflow a win variable -> overflow saved rip to win function -> introduce PIE and do a partial saved rip overwrite -> introduce a canary that the challenge automatically leaks -> introduce a basic primitive for leaking the canary yourself -> etc). If the goal is education, it doesn't seem to be the case that collecting a bunch of random CTF challenges and throwing them at a novice student is the best path forward (or at least, for most people). Designing educational challenges seems to be very useful as it allows students to slowly build up concepts without any big leaps that require tons of external educational resources, all setup in a nicely packaged way. It also makes teaching much more scalable (less big leaps = less questions). All that being said, CTF challenges still seem to be incredibly useful for learning: there is also massive value in tackling problems that don't hold your hands and requires big leaps in a way that more closely resembles "real" challenges. The hope is that our educational challenges create a strong foundation from which harder challenges can be more approachable and educationally valuable.

The second is infrastructure. In the simplest case a course could be run by simply bundling up a bunch of binaries, let students download and work on them, and then submit some writeups on what they did. Not only would this be lame, but it would introduce unnecessary friction on the learning process. Students would need to setup a Linux environment and deal with the numerous problems a novice will face with that (setup some laggy VM, figure out how to go enable Hyper V, hope they don't destroy the VM running some bad command that breaks apt, waste time making their VM look pretty, panic when they go on a trip and don't have the VM installed on their laptop, watch their PC prepare for liftoff running something computationally expensive, etc), all so that they can begin learning. Students should get comfortable with their environment, but this is an entirely separate issue, that has no place getting in the way of what we are trying to teach. To that end, pwn.college has a fully tooled out environment running with persistent data and the challenge fully ready to run that students can just start, SSH into (or even access via VS Code in their browser), solve, and submit the flag. You literally just need a browser. The goal is to minimize friction between the student and the concept. If the challenge requires an isolated kernel because it is a kernel challenge and we don't want to destroy the host machine, it should (and is) as simple as running vm connect to be working in that isolated kernel environment.

As it stands right now, users can:

Use our course material
Use our course material on our hosted infrastructure
Use their course material on our unhosted infrastructure (i.e. manage their own server and private instance using the fully open source infrastructure)

Idea

Why not let users use their course material on our hosted infrastructure.

CSE 466 course material is just the beginning. In an ideal world there would be intro CS challenges, OS challenges, other security challenges (web/crypto/etc), etc. As a start to developing all of that material, it would be incredibly useful for users to not need to manage their own server and infrastructure upgrades, but instead create some challenges and load them straight into https://dojo.pwn.college/ immediately available for users to start working on. In much the same way that we don't want infrastructure to get in the way of users working on challenges, we also don't want infrastructure to get in the way of users creating challenges.

Brainstorming

There are two problems:

How do we injest/manage challenges from a static perspective
How do manage the dynamic part, with challenges being updated/created/deleted over time

From the CTFd perspective, we have these data-components:

Challenge <challenge id, category, name, description, value, hidden>
Flag <flag id, challenge id> (pwn.college dynamically derives the flag value from the challenge_id/user_id)
Solve <challenge id, user id>

Almost certainly the correct solution to injesting/managing challenges is to have users supply a GitHub url, and then automatically pulling everything from there (challenge binaries, any metadata). We could also have some sort of GitHub actions available that lets commits/releases/whatever automatically sync updates. Ideally from a challenge creator's perspective, it should be as simple as registering a github url with the dojo, and then commiting to that repo does the rest.

Statically, we could injest the GitHub url, clone it, parse some metadata, create a bunch of corresponding Challenge/Flag objects, and then let private dojos config's reference those challenges. They would have unique challenge/flag ids, but we would still need a way for users to reference them beyond just category/name (because their could be multiple challenges with the same category/name). The simplest solution is probably hijacking the currently unused description field, and storing the GitHub url there. Then users can uniquely reference these challenges in their private dojo config without knowing the challenge id.

The issue is what happens when we've already injested challenges, but we want to sync because some challenges have been modified/created/deleted.

In the case of creating a new challenge, its probably simple enough to just create the new corresponding Challenge/Flag objects to make the challenge exist. That being said, we need a way of detecting what a new challenge is vs an already existing unmodified challenge. Metadata in the repo could explicitly track versioning info, or we could check the git history for updates to associated files.

For deletion, in some cases it might make sense to delete the challenge entirely from the database, wiping all the solves associated with it. In reality, we should probably stay append-only so we don't lose any data in the case of a user accidentally deleting a challenge. It would probably make more sense to just make the challenge be hidden.

Challenge modifications are especially tricky. We should probably have a modification that involves keeping the associated solves, and one which removes the associated solves. Again, we probably never want to actually remove associated solves for accidental data-loss reasons, but we could for instance set the challenge to hidden, and create a new challenge.

One way of accomplishing this might be the following git file system layout:

- Category1/
  - Level1/
    - non_instanced_challenge_file1
  - Level2/
    - non_instanced_challenge_file1
- Category2/
  - Level1/
    - non_instanced_challenge_file1
    - .version
- Category3/
  - Level1/
    - non_instanced_challenge_file1
    - .instanced/
      - Instance1/
      - Instance2/
    - .version
  - Level2/
    - non_instanced_challenge_file1
    - .version
  - Level3/
    - non_instanced_challenge_file1

Making changes to the .version file will release a new version of the challenge that resets solves. We could also allow people to revert back to an older .version to be able to restore those solves. Implicitly, the version could be 0.0.0. Currently this would be associating a challenge to some Repo-Category-Name-Version. You might want to be able to migrate repos/categories/names while maintaining solves. This would be a pretty niche usecase, but is probably something that would be nice to find a way to support eventually. For example, what if you want to insert a new level between level1 and level2. This will require some thought.

ConnorNelson commented 2 years ago

We might want to not implicitly version to 0.0.0, but instead make it so no version file = disabled, as a way of being able to store challenges which aren't yet ready / need updating.

Probably when performing a sync action, we should have some way of specifying a branch (maybe also a hash?). It's possible we might also want to be able to change the specified branch/hash, while also retaining solves?

We should also make sure "Challenge Sources" are a totally distinct concept from a "Private Dojo". We can then have a private dojo's configuration specify challenges which may come from different challenge sources.

ConnorNelson commented 2 years ago

Another random thought on the idea of reliability: if I was going to use some platform to help run a course, it would be nice to know that an entire semester's worth of critical data for grading doesn't just vanish. This would probably be my biggest concern.

To address that, we should probably do daily database backups. It's not the absolute end of the world if home directories get lost for some reason, or challenge data which are in git, or random logged data is gone. It would be a pretty huge disaster if solve data was lost. Obviously we don't have any record of this sort of thing happening, but it's probably a good idea to get in front of.

However, we shouldn't just expect people teaching a course to optimistically rely on us to not let this happen. We should probably add endpoints that allow for pulling down (public) solve data. In theory everything needed to grade a course is already public data (minus figuring out how to match users -> students, but that's a separate issue), but it would help if we had endpoints and GitHub actions which could pull down all the relevant data daily to some course repo (a list of <user id, challenge identifier, solve time>). This of course doesn't address reliability from a downtime perspective (though our uptime record does!), but it definitely offers peace of mind that critical data doesn't magically vanish forever out of no where.

ConnorNelson commented 1 year ago

We doing this now, its cool.

pwncollege / dojo

Private Dojo Challenges #44

Background

Idea

Brainstorming