hfurubotten / autograder

Automatic management and build tool for lab assignments. Moved to organization autograde: https://github.com/autograde
https://github.com/autograde
Other
14 stars 7 forks source link

database inconsistency issues #71

Open meling opened 8 years ago

meling commented 8 years ago

It seems very easy to end up with a database that is inconsistent with the groups on github.

Can we avoid storing this stuff in the database, and only rely on git/github for storage? We can of course keep stuff cached in memory (or even on disk), but perhaps let github (or some other git storage) be the ground truth for data storage.

Idea: Instead of using a database (for some stuff), perhaps we could mirror the github repos locally to avoid the latency of going to github's data center. Figure out how other do organizations are doing this, e.g. golang.

hfurubotten commented 8 years ago

Im not sure if this can be blamed on the database itself. The database just do what we tell it to do, so if we just replace it with another storage solution the problem will just continue with the new one. Its better to concentrate on the code which make it inconsistent, than making the database a escape goat.

Where does this inconsistency present itself?

If we use GitHub itself to keep control over what is stored under a group object, there will be a lot of traffic from GitHub, which will use up the quota we have for traffic towards them.

However it would indeed be good to research how other organization with similar use cases have implemented this.

meling commented 8 years ago

The issue was discovered today when Eric was trying to pull code from the different groups to run his plagiarism checker, and it turned out that autograder database had one set of groups known locally, while a different set of groups existed on github.

Like I mentioned above, the golang project uses github as a mirror for another git-based service running on some other server. Not sure exactly how it works, but it seems plausible to me that we could simply use github as the main data storage, and then mirror from it whenever there are changes. That way we only have one type of storage, namely git. It will require more local storage on autograder, but it shouldn't be too bad for most cases, I think. The issue is, how do we keep the two consistent, and how do we interact with the local git repos. I know there are some git APIs that we could use. Will need some more investigation though...

@tormoder do you know how the golang folks are doing their mirroring thing?

hfurubotten commented 8 years ago

Looks more like a problem with the migration tool, where the groups wasn't properly mapped towards the repos on GitHub.

Even if we mirror the git repos locally from GitHub, I believe there will be just as much splitting of information. For instance, how do we collect member information which is only stored at GitHub?

meling commented 8 years ago

Looks like others have used git as a database too:

https://joeyh.name/blog/entry/databranches/

hfurubotten commented 8 years ago

Looks like this solution will store all the data to a branch on the students repo, thus presenting a risk for the students to alter this data themselves.

To get away from this then it needs to be stored only locally and then its just a normal file storage, which will have the same function as the old storage solution we moved away from with the database.

meling commented 8 years ago

I'm thinking that we may need a database (be it a local git repo or some other (distributed) database) for storing private information that students shouldn't have access to. But the stuff that should be consistent with github should be stored on github and locally in a git repo. That way, we can also easily run the plagiarism checker on those repos.

hfurubotten commented 8 years ago

Sounds like we will then have different storage solutions for different information, which means we will quite easily loose track of what information is supposed to be stored where.

I believe that its better to take a look at the solution we already have and improve the weak points this have and improve those, rather than splitting the information. And to make sure this information is up to date, make sure it is actually updated when we get notified by GitHub on any changes on their side. We already get all the updates anyone does on GitHub, and the information changed, and this just need to be put to good use.

meling commented 8 years ago

The problem could still happen if we forget to check github always before update the database, because changes can be made on github during autograder downtime. In this case autograder won't get any notifications. It seems a bit fragile to me.

Of course, the same can happen to a local git repo, but such changes are simple mirroring operations from github to a local repo. These can be fixed manually without having to write a migration tool. When we make changes we must always ensure to do updates on github only.

Of course, we can have problems with reading old data that haven't yet been mirrored.

At any rate this will need more thinking.

ericnorway commented 8 years ago

Here's a bit more information. The groups in DAT320 from this past semester are ok. The groups in DAT520 from the spring were ok in the spring, but now they do not match what is in GitHub. For example, I was in Group 5, but Autograder now shows me in Group 15.

Autograder shows the group numbers as 1-16 for DAT520. In GitHub the group numbers are 1-7, 10-13, 15-18, 20-21.

I noticed this yesterday when the anti-plagiarism was trying to pull repositories from GitHub which don't exist.