Archive StudentSubmissions before refresh - Githubissues

kippnorcal / google_classroom

Google Classroom Data Pipeline

GNU General Public License v3.0

23 stars 9 forks source link

Archive StudentSubmissions before refresh #91

Open dchess opened 4 years ago

dchess commented 4 years ago

If a student is removed from a course all of the submission data for that course is lost. We should create a snapshot of prior day data before truncating the table.

Possible logic:

Before truncating StudentSubmission table, append data to archive table
Refresh StudentSubmission table
Remove any matching assignment (by submission id) from archive table that exists in current table

This logic would also help maintain year over year data as well. This is potentially something we should consider for other data sources: teachers, students, courses, coursework, etc.

dchess commented 4 years ago

@zkagin Can we prioritize this piece of logic before moving forward more on the syncing logic?

zkagin commented 4 years ago

Yep. Just to confirm, the end result we want is an ever-growing list of deleted StudentSubmissions, correct? And built so that we can use this logic for any of the endpoints as needed?

dchess commented 4 years ago

@zkagin Yeah. I think we'd want to maintain a list of data that has previously come through so the truncation of the main table doesn't result in data loss. @denglender also let me know today that the Meets endpoint only stores the last 6 months of data. So we will want it for that as well. I think putting this logic in place for all endpoints makes sense as well since even a year-to-year rollover when the env var for start date is changed, would impact the historic data.

What are your thoughts on a separate archive table vs. removing truncation and using a process that appends and then deletes duplicate records (keeping the most recent)? That would eliminate the need to maintain more tables.

zkagin commented 4 years ago

Yep, the cleanest way is probably a soft delete pattern, where there's an additional "deleted" column. The downside is then we start needing to deal with schema migrations, since we are no longer dropping the table after each run. The first migration can be a simple if statement, but future ones probably should be handled more robustly with something like Alembic.

The diffing logic I sent in #98 would actually be pretty useful here. The steps would be to pull the latest data, diff it to the DB, change the relevant rows to deleted=True, and then append the new rows.

Let me know which approach you'd prefer.

zkagin commented 4 years ago

@dchess Just checking in on this again — let me know if you like the described approach or if you had something else in mind.

dchess commented 4 years ago

@zkagin Thinking about this more, I'm wondering if the deleted column is necessary? Shouldn't IDs be unique? Couldn't we append then remove duplicate IDs preserving the matching ID with the most recent updateTime?

zkagin commented 4 years ago

Yep, if you don't need to know which student submissions are from deleted / removed students, then we shouldn't need a delete column. That might be a short-term answer to it.

I think checking for differences and then doing adds/updates/deletes to the DB is better than appending and then deduping both from a stability and performance perspective (minimizing DB writes in both cases). I can start building off of #98 and use that to build this for StudentSubmissions.

dchess commented 4 years ago

@zkagin That makes sense to me (minimizing db writes) but I'm also concerned about memory limitations by keeping the diffing in local memory. I've been hitting some memory thresholds on the coursework and submissions endpoints lately.

If the diffing could be done at the batch level and minimized the dataframes held in memory before writing the inserts then I could see that approach working.

zkagin commented 4 years ago

Thinking more, you could probably infer that something is deleted by seeing how recent the updateTime is, but that feels pretty non-explicit if that information is relevant (for either StudentSubmission or for any other Endpoints we eventually want this functionality for). Either way, as a first-pass solution this should work without changing any DB schema.

zkagin commented 4 years ago

@dchess Are you seeing any performance issues when you run it on your own machine, or is the production environment more limited? I can try and simulate it on my own machine to make sure memory isn't an issue.

dchess commented 4 years ago

@zkagin I'm not sure its a delete in all cases. In some it's more about preserving previously pulled data (ie when the start date changes).

Memory has been an issue in production, but we are running slim VMs (2-4GB). Could be this job requires more resources.

zkagin commented 4 years ago

@dchess Makes sense then, let's do it without a delete, and can add additional information later if need be.

If memory is in an issue, we will need to handle it for writing back to GC anyways, so we can do batching by ID if need be in a pretty straightforward manner.

zkagin commented 4 years ago

@dchess I did some playing around with this, and it looks like pandas only really likes to write/append to SQL, not delete or update. So we have a few options with some tradeoffs:

Loop through individual items to do updates or deletes from pandas data (this might have some performance implications if there are a ton of items to update, but may not be bad at all if it is run more frequently).
Follow what you mentioned earlier by appending all the new data, and then execute a SQL command to do the deduping and deleting. I usually try to avoid writing SQL in code, but this might be a case to warrant it.
Hold all of the incoming data in memory, merge it with the existing table, drop the table, and then rewrite all of it together. This might have memory issues if it has to hold all submissions data in memory.
Some combination of 2 and 3, where you append all of the new data, then read it out and merge it, dedupe it, drop the table and rewrite it all together. This would similarly have memory issues if the 3rd option has it.

What do you think?

zkagin commented 4 years ago

@dchess One other idea to build on the 1st/2nd would be to write the updated/new data to a temp table, then write a shorter SQL query to upsert the data into the existing table.

dchess commented 4 years ago

@zkagin check the docs for sqlsorcery I have some examples of how to do updates/deletes by dipping into sqlalchemy functions.

zkagin commented 4 years ago

@dchess I looked around and saw options for dropping/truncating a table and for returning a SQLAlchemy table object to run commands, but couldn't find anything for doing batch updates/deletes from a dataframe. Could you point me to what you were thinking of?

dchess commented 4 years ago

@zkagin Updates: https://sqlsorcery.readthedocs.io/en/latest/cookbook/etl.html#update-table-values

Deletes: https://sqlsorcery.readthedocs.io/en/latest/cookbook/etl.html#delete-specific-records

zkagin commented 4 years ago

Ahh yes. Unfortunately I think we'd still have the problem that there's no way to summarize which rows need to be updated? If it is an arbitrary set of rows, it seems we still need to iterate through each item to be updated and call it separately.

It now occurs to me though that if we can batch the execution of it, this may be totally ok. I would just want to avoid having to write and commit each one of these separately.

Do you know if there's any other way to batch update arbitrary rows from dataframes?

dchess commented 4 years ago

@zkagin One approach would be to:

append new records to the table
query all records with with duplicate IDs and their updateTime
delete records with the MIN updateTime if ID in that dataset

Another approach:

API pulls GC data into df
Query table for any matching IDs
Delete those IDs from table
Insert all new records