snap-cloud / snapCloud

Official cloud backend and community site for the Snap! programming language
https://snap.berkeley.edu/
GNU General Public License v3.0
52 stars 28 forks source link

Event Logging API / Support #225

Open cycomachead opened 5 years ago

cycomachead commented 5 years ago

For research projects we should support logging events that happen in Snap!.

This is a mix of Snap! support, and a back-end API to support accepting log messages and placing the somewhere.

Cloud Responsibilities

Some things I think we need on the back end, but this all needs to be scoped.

Architecture

I think it's important this be separate from the main DB, and maybe the particular endpoint could bypass lapis for pref reasons, if necessary.

I think 1 endpoint that we host that's a forwarding endpoint is probably best, since it minimizes the total sites that data could be logged to. It offers the most protection if the backed were to actually validate whether a should be logged before it's forwarded, though that's probably not necessary.

I suspect we have low enough volume that we have dozens of options for log storage... A separate pg instance is probably a decent option.

cycomachead commented 5 years ago

@thomaswp @brollb

Since you two have done some logging before -- I'm wondering if you could give some background on how much logging you have typically done/seen. How many events/sec would a typical student generate and how large would those events be?

Can be total ballpark figures... I'm just trying to think about if we can actually support half a dozen research projects all at once. 😄

thomaswp commented 5 years ago

In iSnap, a class of ~50 students working for ~5 weeks (6 assignments + a project) will generate ~500K logs, which amounts to 1GB uncompressed. As for frequency, that's configurable. I have it set to log at most once per 3s. It's a tradeoff between traffic and avoid data loss if the browser is abruptly closed. Regardless, the number of actual rows is the same, and that's determined by how many edits students make per second. I think ~1/s on average is pretty normal.

A few things to keep in mind there:

thomaswp commented 5 years ago

P.S. I believe the Blackbox dataset, which has been logging way more data than Snap likely will have for the past 5 years is only at ~2TB. So if you're just logging source (no media) that's pretty cheap relatively speaking.

thomaswp commented 5 years ago

One more note: If you're curious about how to log your data (what to include, how to export it, etc.), I've been part of a sizeable group of researchers working to develop a standard called ProgSnap2. See the:

cycomachead commented 5 years ago

Thank you! For a single class that’s manageable. 1/s even scaled is manageable load for us, though it’s data storage that gets tricky (especially since there’s essentially 0 funding right now.)

But, I think we can do something pretty easy with a dead simple API and S3. I’m definitely looking into ProgSnap2. Certainly it would be nice to have a common format.

Do you ever need to check where to enable tracking on a per-user? Or if people opt-out do you just not use their data?

thomaswp commented 5 years ago

We usually just remove user data after the fact if they opt out (in part because users can withdraw consent later if they want). I imagine it would be pretty easy though - just have a client- (and probably server-) side check to make sure the user has consented before logging.

cycomachead commented 5 years ago

@emansishah

cycomachead commented 5 years ago

Random idea: If logging is a flag on a project, remixing the project should propagate the flag. (Could also be in the XML...)

brollb commented 5 years ago

Yeah, as @thomaswp said, we log edits in NetsBlox so they are less redundant than full snapshots but would need to be reconstructed if analyzing arbitrary "snapshots". Another perk is that edits are saved when they occur (rather than on a standard interval). Replay data is actually saved in the project xml as well as on the server (required for collaborative editing). Saving the replay data in the project can actually be disabled in the project settings. It is nice to have the creation data stored within the project as it doesn't add any complexity for project submission, etc. I could imagine making a teacher dashboard which takes the submitted student project and enables them to easily inspect previous versions or whatever aspect of the project creation they care about.

That said, it is also worth thinking about what actions you would like to log. About a year ago, we added a number of other user actions to our logging including green flag clicks, executing individual blocks, etc (https://github.com/NetsBlox/Snap--Build-Your-Own-Blocks/issues/429).

I still have to look at ProgSnap2 in more detail but I am generally a fan of developing a common spec :)

brollb commented 5 years ago

Another perk about saving the edits in the project is that a flaky internet connection won't result in missing edits on the server (if some of the edits occur while the connection cuts out) and will be saved as long as the project is successfully saved.