nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Use unzipped files to facilitate contributions #128

Closed ArthurClemens closed 5 years ago

ArthurClemens commented 5 years ago

Having zipped folders in the repo makes it difficult to assess contributions.

PRs will include updated zip files, but that makes it hard to see the changes. There is no default tooling to check diffs on zip files.

I propose to use unzipped files for the source repo, and to zip up everything for releases/distributions.

stevenbird commented 5 years ago

The use of github for data distributions is already problematic, and we have been considering alternatives.

alvations commented 5 years ago

Although having non-zipped files is better for book-keeping but at the risk of making git clone and transfer of multiple files using nltk.download() really slow, esp. when some corpora have >1000 files. I would vote against using multiple files instead of a single zip for a dataset.

I've been experimenting with https://github.com/alvations/data; it's easier to manage versions and we can do diff by rows when necessary. It requires a lot more work to get through all the datasets but I think we can have a POC if we install the popular collection from nltk_data first. And allow it to be pip-able then add some extra code in nltk to handle reading data from the .tsv format.


P/S: Fastai also has a list of datasets that they are maintaining but they're not handling packaging of the datasets https://course.fast.ai/datasets.html

We have also started a thread discussing how to move forward and make the data downloading and loading more modern but there were no response by the developers. https://groups.google.com/forum/#!topic/nltk-dev/LjThWkAthwc Maybe discussing here would get more attention.


C.f. the nltk-dev group post:

Dear NLTK contributors and devs,

Following up on the issue https://github.com/nltk/nltk/issues/2079, my proposal is to create a new NLTK data distribution such that users don't (only) rely on github.com to pull nltk_data. To quote the issue:

I've explored Kaggle Datasets, dropbox and zendoo and even data distribution as PyPI packages. But there's always a limitation of

  • how available can the data be? I.e. does it need user to sign up an account? how many hops/steps to take before user can get hold of the data to be read by nltk.corpus. Up till now, nothing beats the simplicity of pulling from github zip files.

  • how to track data precedence? I.e. when the data is updated, is there version? How do we go back to track changes and possible have some sort of git blame mechanism to debug what went wrong if it happens

  • how much support is the CDN going to give? There's always a case of bandwidth limit for files up/downloading and also a storage size limit. I think the latter is cheap but the previous is hard.

It is possible that we get NLTK datasets onto Amazon S3 and get free hosting but it'll take some time to port the data there and I'm not sure of how the data would be accessed and I'm not personally used to tracking data changes on S3. Additionally, if a contributor wants to add new dataset or edit datasets, how would this be done with S3 easily without having a bottleneck of someone with admin access manually uploading it.

Here's a summary of some of the things we've explored https://docs.google.com/spreadsheets/d/10qPsmTAa707Ct_Fej6_BQg77l38-qEsT9eJPyaue8HY/edit?usp=sharing

ArthurClemens commented 5 years ago

You could maintain 2 branches: master with all zipped folders, and development with all text files.

master is updated every now and then (e.g. once a month) by copying the latest state of development with a zipping script.

Users are encouraged to clone the master branch only, using the command git clone https://github.com/nltk/nltk_data.git -b master

stevenbird commented 5 years ago

I'm sorry we don't have capacity for taking on more, even though this is an excellent suggestion.