Open MicheleTobias opened 7 months ago
For adding the context back in, here's the link to my original reader for this workshop.
I did some exploration to try to figure out the best way to subset the data set:
items
(103.4 MB)
items
data frame by checkout year (possibly with duplicates):
2019 2020 2021 2022 2023
"43.9 Mb" "16.8 Mb" "16.5 Mb" "27.8 Mb" "30 Mb"
items
data frame columns (above 2 MB):
title author subjects
"17.5 Mb" "8.5 Mb" "54.4 Mb"
publisher publication_place
"4 Mb" "2.4 Mb"
users
(0.1 MB), all with checkoutscheckouts
(13.1 MB) spanning 2019-01-02 to 2023-12-28
This makes me think we should focus on subsetting the items
table. The subjects
column in the items
table is almost 50% of the size of the data set. The value is generally a long list of subject categories for the item, which might be interesting for demonstrating text search with LIKE
, but probably not relevant for other commands. The title
and author
columns also have searchable text, so maybe we should drop subjects
.
If we drop subjects
as well as all inactive items, we can get down to ~25 MB. I think keeping the checkouts
table intact is worth it because it covers Jan 2019 - Dec 2023 and you can definitely see the effect of the pandemic.
If the data set can be a little bit larger, say ~35 MB, we could:
subjects
column@MicheleTobias let me know what you think would work best, whether one of these ideas or something else, and I'll update the R script to do it.
@nick-ulle Thanks for layout out some good options! I think ~35MB is really reasonable. The subjects
column is probably the least useful since it's a list. The author
column can provide similar learning opportunities, so subjects
isn't really needed. So that's my vote.
I dropped subjects
and kept a sample of 15,000 inactive items and it came out to 25 MB. The code is in R/
in this repo, and the file is on Google Drive. Let me know if there's anything else I can help with!
Thanks, @nick-ulle ! This is great! I don't think there will be any more tasks, but I'll let you know if that changes. The only thing left is for me to finish going through the hands-on section to work with the new data and add some explanations.
The library checkouts data that Nick formatted is available here. We need to: