queens-qmulus / qu-scrapers

Qmulus Data Scrapers
https://qmulus.io
MIT License
5 stars 1 forks source link

Consider Using Git LFS for Large Data Dumps #13

Open alexadusei opened 4 years ago

alexadusei commented 4 years ago

I'm running into issues pushing/pulling large files (over 10MB) from traditional Github repositories (our news and courses scrapers are culprits here). Rather than dealing with the Git Database API, we should maybe consider Git Large File Storage instead. The main caveat here is the additional overhead.

etenoch commented 4 years ago

At some point we should use actual storage buckets (s3). Git isn't really meant for this

alexadusei commented 4 years ago

We should maybe be having that conversation now if we want to include courses in the next release.

etenoch commented 4 years ago

Is it possible to just use regular Git? Or does it not work at all/not work reliably? I don't want to increase the scope but if we really need to address this, then we can discuss.

alexadusei commented 4 years ago

I believe I looked into that at one point. I used a Python library called GitPython.

However they admit that its limitation is its leaky resources if you're using long-running tasks. But considering we're not running a long-lasting tasks.py service anymore, maybe we could try this again?

etenoch commented 4 years ago

Would LFS be any simpler to use from an upload perspective? Also are you able to access the raw file with LFS like you do with a regular git file?

alexadusei commented 4 years ago

Not simpler.. it could be a bit involved. We're using a Python wrapper for GitHub called PyGithub, and the wrapper capabilities for Git LFS might be much thinner in functionality from what seems to be available out there.

I haven't looked too deeply in Git LFS but I believe you're able to access the raw file (ref). I'll be looking into this later today. If we want to swap this for S3 for convenience (and find a way to make this easily accessible via our datasets repo), then we can look into that instead.

etenoch commented 4 years ago

Alright if it’s not took much more work to upload in python and retrieve in JS from LFS, then maybe it’s a good solution. Using with git (over bucket storage) let’s us stick with gits built in versioning (which we want to leverage on the ingest side)

alexadusei commented 4 years ago

Sounds good, I'm wondering if GitHub changed their limitations. I remember it being 50MB but apparently its 100MB. We have no datasets that go over that amount (our largest is textbooks sitting at 72.2MB). So I'll try out LFS if necessary once our pushing module.

etenoch commented 4 years ago

There's a 1gb total repo size limit. I wonder how effecient it is at only sitting the diffs of the of new files. (if we upload a 72mb file 14 times, would that reach the repo limit?)

It looks like all of cobalts datasets are much much smaller than ours. Only 1(?) of theirs is over 1mb (and it's still pretty small). Leads me to think that GitHub might not be the best afterall (maybe we should consider LFS or bucket storage now instead of later)

alexadusei commented 4 years ago

Yeah, 1GB size limit is interesting. We could always pay for GH Enterprise to get a bigger size.. but then that'd be defeating the purpose of using GH for storage.

Also remember: Those 72MB of news is from "all time", being 2001 - 2018. If that's the average rate over ~18 years, and if it were that evenly distributed, then we'd just be producing ~4MB of news data per year, so it'd take us 256 times to reach the repo limit, if we hypothetically pushed it annually. Pretty crude estimations, but you get the idea.

Cobalt is interesting because they're pretty much like us, minus the news. What's more interesting is how Waterloo's API collects news, and they're not having this problem (as far as we know) 🤔

LFS still might be more attractive so I can look into this. On the other hand, if we're saving news for later (whenever 'later' is in this case), we can stick with GH for now and investigate LFS once we have an initial release going.