dataset is too large - Githubissues

ml5js / ml5-data-and-models

Data sets and pre-trained models for ml5.js

https://ml5js.org/docs/data-overview

MIT License

125 stars 98 forks source link

dataset is too large #4

Closed handav closed 6 years ago

handav commented 6 years ago

I'm running into difficulty uploading the dataset to github, which is probably to be expected. Specifically, it's too large (313 MB). Some ways around this:

1) The error message suggests using Git Large File Storage: https://git-lfs.github.com Has anyone used this? Does it require the downloader to also install this? I don't want to create added steps for anyone using ML5. 2) We could assume that a third party will generally host datasets, and then write scripts or functions so that users can download and process the data from those sites. The downside of this is keeping the scripts updated, and not necessarily having data that ML5 can always refer to/control, especially for use with any examples that use it.

Thoughts?

handav commented 6 years ago

We could also make any included datasets super small toy datasets, which would probably be anything <1000 images.

cvalenzuela commented 6 years ago

I've used git lfs a couple of times. It works, but the downside is that it has quotas and you need to pay to get more. Often doing git lfs pull will throw:

batch response: This repository is over its data quota. Purchase more data packs to restore access.

Another option, if the files are just 313MB, is to partition them in 4 files and then have a script that rearrengenes them into one file.

You can use something like:

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of max size 20 megabytes.

shiffman commented 6 years ago

What about the following?

We include smaller "toy" versions of the datasets on github.
We host the larger datasets via some cloud based thing and include shell scripts that download the datasets automatically in github with instructions for how to run the shell scripts?

I think managing the giant files in github may end up being more of a hassle than its worth and not really what git/github is intended for?

Would an S3 bucket do? We can use ITP's account!

handav commented 6 years ago

I think that makes sense! I agree that we don't want giant files in github. The whole point is ease and simplicity, so I'll cut the dataset to a more manageable size for now, and we can talk about the logistics of hosting the larger datasets next week!