nltk / nltk_data

NLTK Data
1.44k stars 1.04k forks source link

License #102

Open djsutherland opened 6 years ago

djsutherland commented 6 years ago

Can you clarify what license the nltk_data files are under? Is it the same license as nltk? Do the various data files have different licenses? conda-forge would like to begin packagaing nltk_data, because a few users have requested it (to make installing more uniform / track versioning / etc; https://github.com/conda-forge/staged-recipes/pull/4463), but we'd need to know the license first.

alvations commented 6 years ago

The different resources in nltk_data comes in different licenses. The licenses of the individual resources in nltk_data should be safe for re-distribution.

It'll be great to package nltk_data, would it be a pip-able data library?

djsutherland commented 6 years ago

It wouldn't be in pip, but you could get it with conda install nltk_data (assuming you've set up conda-forge: https://conda-forge.org).

I see now that the xml files specify the licenses of the data files. I guess the question is what license the xml files themselves have...they're so small that I doubt it really matters, but still not technically specified. Anyway, I guess we'll just say "License: Various" or whatever, still need to figure that out amongst ourselves though.

saswata64900 commented 5 years ago

s in One of our NLP project is completely dependent on NLTK tokenizer and POS tagger. But recently we figured out that the tokenizer and POS tagger models do not have a license and hence we are not able to use them in our project. Is it possible to add a license for those two models? Is there any other models available in the net for tokenizer and POS tagger which is open source?

thesamesam commented 1 year ago

This remains a problem for distributions packaging nltk. Looking at https://www.nltk.org/nltk_data/, many of the fields have a blank licence/copyright field.

Would it be possible for nltk to construct a free/libre dataset which can be safely redistributed? Thanks.

tomaarsen commented 1 year ago

Many of the NLTK data resources themselves contain licensing, copyright or README files that contain additional information on to what extent the data may be distributed. Perhaps that will help somewhat.

thesamesam commented 1 year ago

I did end up untarring the whole lot and taking a look but many of them had either no README (etc) or if they did have one, indicated they were proprietary.

mgorny commented 1 year ago

For the record, I'm removing NLTK from Gentoo because of this. IANAL but it looks like many of the corpora shouldn't be redistributed as part of nltk_data in the first place, and letting NLTK download them puts users at risk of copyright violation.