kippnorcal / google_classroom

Google Classroom Data Pipeline
GNU General Public License v3.0
23 stars 9 forks source link

Possible to get only new or updated data? #78

Closed bkimup closed 4 years ago

bkimup commented 4 years ago

Anyone know if it's possible to get new/updated data from the API? I have a similar script running to pull data, but have a bottleneck on StudentSubmissions since it has a huge amount of data now after about ~2 months of remote learning. Curious to hear if it might be possible to get new and or recently updated data in incremental fashion.

dchess commented 4 years ago

@bkimup Unfortunately the Google Classroom API doesn't have a built in query param for dates. However, one idea we've considered is querying the coursework data for recently created assignments and then looping over the submissions endpoint by courseWorkId instead of courseId. See open issue #49. We've also considered further improving performance using threading, but we haven't hit a performance bottleneck to justify it yet.

One thing you can do to troubleshoot bottlenecks is to play around with the batch size. For our data volume we've found a batch size of 120 to be optimal but we have less than 7000 students.

bkimup commented 4 years ago

Got it, batching isn't supported for studentsubmissions, right? It takes our script about an hour just to pull studentsubmissions for about ~230 courses, so I'm just trying to think of what we can do to improve the speed as the dataset grows.

dchess commented 4 years ago

@bkimup That's surprising to me. StudentSubmission is batched. We have ~1300 courses and it finishes in <20 mins for us.

zkagin commented 4 years ago

@bkimup I have a few hypotheses for what might be happening. If you run it in debug mode (using the --debug flag), can you see if any of the problems below might be happening?

Also, I'd love to chat more about how you're using this library and if I can help contribute to other tech challenges you are solving. Drop me a note at the email in my profile (https://github.com/zkagin) if you'd like to chat!

bkimup commented 4 years ago

Ah I should have followed up after DC's comment - I'm not doing any batching at the moment. I didn't know about batching when I first wrote my script and didn't think it was applicable to StudentSubmissions after I learned about it. So your first assumption is correct - I'm iterating through all of our courses' courseworks, and that's why it's taking so long to get results. I haven't had time this week to refactor my script to use batching, and hope to do it soon. Does the response from a batch request look the same as a normal request?

zkagin commented 4 years ago

@bkimup To clarify, are you using the code in this repo or do you have your own script you are writing? This repo should automatically handle the batching for you now. If you're implementing it yourself, there are a few modifications you'll need to make. There's a good guide at the link below, and you can use the code in api.py for inspiration if you'd like.

https://developers.google.com/classroom/guides/batch

bkimup commented 4 years ago

Sorry for the confusion, I'm writing my own script and was trying to add the batching feature to it since I wouldn't be able to utilize this repo. Not sure if this is the right place to ask for how batching works - I'm having a little trouble figuring out the response from a batch request. Is it alright if I follow up with you one/both of you guys via email?

zkagin commented 4 years ago

@bkimup I'd be happy to help. What is causing this repo to not be useful for you? Perhaps we can adapt or expand it to make things easier as well.

The batch documentation is pretty useful here, especially the code sample at the bottom: https://developers.google.com/classroom/guides/batch Once batch.execute() is called, it blocks until all of the requests have been returned. Each request then calls the callback function as it comes in. request_id is whatever id is passed in with the request (if any), response is the expected response you are used to, and exception would be any errors. Once all of the requests have returned and each called the callback function, the code proceeds from after batch.execute().

The code in this repo provides some of the corner cases you may need to handle. https://github.com/kipp-bayarea/google_classroom/blob/master/api.py#L162

Feel free to contact me via the email in my profile if you'd like to follow-up separately, or I am happy to continue answering questions via Github.