ucdavisdatalab / workshop_intro_to_sql

Reader for the Intro to SQL workshop series.
https://ucdavisdatalab.github.io/workshop_intro_to_sql/
Other
25 stars 7 forks source link

Rewrite for Library Checkouts Data #3

Open MicheleTobias opened 7 months ago

MicheleTobias commented 7 months ago

The library checkouts data that Nick formatted is available here. We need to:

MicheleTobias commented 7 months ago

For adding the context back in, here's the link to my original reader for this workshop.

nick-ulle commented 7 months ago

I did some exploration to try to figure out the best way to subset the data set:

This makes me think we should focus on subsetting the items table. The subjects column in the items table is almost 50% of the size of the data set. The value is generally a long list of subject categories for the item, which might be interesting for demonstrating text search with LIKE, but probably not relevant for other commands. The title and author columns also have searchable text, so maybe we should drop subjects.

If we drop subjects as well as all inactive items, we can get down to ~25 MB. I think keeping the checkouts table intact is worth it because it covers Jan 2019 - Dec 2023 and you can definitely see the effect of the pandemic.

If the data set can be a little bit larger, say ~35 MB, we could:

@MicheleTobias let me know what you think would work best, whether one of these ideas or something else, and I'll update the R script to do it.

MicheleTobias commented 7 months ago

@nick-ulle Thanks for layout out some good options! I think ~35MB is really reasonable. The subjects column is probably the least useful since it's a list. The author column can provide similar learning opportunities, so subjects isn't really needed. So that's my vote.

nick-ulle commented 7 months ago

I dropped subjects and kept a sample of 15,000 inactive items and it came out to 25 MB. The code is in R/ in this repo, and the file is on Google Drive. Let me know if there's anything else I can help with!

MicheleTobias commented 7 months ago

Thanks, @nick-ulle ! This is great! I don't think there will be any more tasks, but I'll let you know if that changes. The only thing left is for me to finish going through the hands-on section to work with the new data and add some explanations.