Add new "politicians_excuses" corpus with excuses texts of USA, Canada, UK politicians

HalynaOlesiuk commented 7 years ago

Could you please respond ASAP, because this work is part of my master's work at the university. And I need to include the link to added corpus in results if you approve this corpus.

Url can be: github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/excuses.zip
Data is stored in xml file, so XmlCorpusReader is enough to read the data.
For beginning we can add new folder excuses and add there politicians_excuses.xml. Probably later someone will add another kind of excuses.
corpus will be redistributable under Attribution-NonCommercial-ShareAlike license

Thanks.

P.S.: Attachment contains excuses file with 232 excuses speeches. excuses.zip

alvations commented 7 years ago

Thank you for proposing the contribution. I think it will be a really useful

Regarding the licensing, I think someone more familiar with licensing and data distribution should take a look at this first. I'm not sure whether you're able to release this corpus under CC-BY-NC-SA.

There are 2 components of the corpus:

source text + metadata
annotations

As for the annotations, they come from your project and they should be okay to release under any license you would choose.

But for the source text, they come from various sources that comes with their on T&C, e.g.

CBS ones is not okay for CC-BY-NC-SA: http://policies.cbslocal.com/terms-of-use/

The content, information, data, designs, code, and materials associated with the Services ("Content") are protected by intellectual property and other laws. You must comply with all such laws and applicable copyright, trademark, or other legal notices or restrictions.

Subject to these Terms, you may access and use the Services only for your own personal, non-commercial use. We reserve all other rights to the Services and Content, and you may not otherwise copy, reproduce, distribute, publish, display, perform, or create derivative works of the Services or Content without our permission. You also may not transfer or sublicense this limited right to use the Services or resell the Services.

ESPN (Disney) is a little more problematic:

The Disney Services are our copyrighted property or the copyrighted property of our licensors or licensees and all trademarks, service marks, trade names, trade dress and other intellectual property rights in the Disney Services are owned by us or our licensors or licensees. Except as we specifically agree in writing, no element of the Disney Services may be used or exploited in any way other than as part of the Disney Services offered to you. You may own the physical media on which elements of the Disney Services are delivered to you, but we retain full and complete ownership of the Disney Services. We do not transfer title to any portion of the Disney Services to you.

They should be fine and go under fair use clause since it's not a huge chunk of each article but also because it's not a huge chunk, the lack of context might make the annotations and data seem biased and get into trouble esp. if the original content provider hunts down on the data distribution.

BTW, is this under submission to LREC or any other conference or workshop? It'll be easier to redistribute peer-reviewed datasets, esp. when it comes to politics-related datasets. And IMHO, I think that if the models that people want to build on depends on your data, data ethics is especially sensitive for this corpus.

stevenbird commented 7 years ago

@HalynaOlesiuk: thanks again for your interest in contributing a corpus. We have decided to document our assumptions about corpus contributions on our wiki here:

https://github.com/nltk/nltk/wiki/Adding-a-Corpus

This means we have decided not to accept this contribution at present.

nltk / nltk_data

Add new "politicians_excuses" corpus with excuses texts of USA, Canada, UK politicians #97