Closed handav closed 6 years ago
We could also make any included datasets super small toy datasets, which would probably be anything <1000 images.
I've used git lfs a couple of times. It works, but the downside is that it has quotas and you need to pay to get more. Often doing git lfs pull
will throw:
batch response: This repository is over its data quota. Purchase more data packs to restore access.
Another option, if the files are just 313MB, is to partition them in 4 files and then have a script that rearrengenes them into one file.
You can use something like:
split -C 20m --numeric-suffixes input_filename output_prefix
What about the following?
I think managing the giant files in github may end up being more of a hassle than its worth and not really what git/github is intended for?
Would an S3 bucket do? We can use ITP's account!
I think that makes sense! I agree that we don't want giant files in github. The whole point is ease and simplicity, so I'll cut the dataset to a more manageable size for now, and we can talk about the logistics of hosting the larger datasets next week!
I'm running into difficulty uploading the dataset to github, which is probably to be expected. Specifically, it's too large (313 MB). Some ways around this:
1) The error message suggests using Git Large File Storage: https://git-lfs.github.com Has anyone used this? Does it require the downloader to also install this? I don't want to create added steps for anyone using ML5. 2) We could assume that a third party will generally host datasets, and then write scripts or functions so that users can download and process the data from those sites. The downside of this is keeping the scripts updated, and not necessarily having data that ML5 can always refer to/control, especially for use with any examples that use it.
Thoughts?