provenance, copyright holders and licensing of `gensim/test/test_data/`?

On behalf of my employer, I have packaged gensim for Debian:

https://tracker.debian.org/pkg/gensim

In the process of auditing the gensim git repository for inclusion in Debian I noticed by using web search engines that some of the files in the gensim/test/test_data/ directory seem to have been copied from the user comments on various websites such as IMDB. Presumably these comments were not owned by RaRe Technologies (or other gensim contributors) and were not licensed under the LGPL like the rest of gensim.

Other files seemed to indicate they were copied from Wikipedia, which definitely isn't LGPL. Others seemed to be statistics computed from some data and others seem to be generated files.

So I then wondered about all the files in the test data directory; where they came from, who owns them, what license they are under and since many of them are binary files how they were generated, what data were they generated from, what tools were they generated with and what the copyright/licensing of those tools are.

Without any answers to these questions I wasn't confident that I could get gensim into Debian quickly, so consequently I removed this directory from the Debian source package and added some patches.

I don't know if it will be feasible to reconcile this difference between the gensim git repository and the Debian source package, but I wanted to bring this to your attention and start a discussion about it.

It was mentioned in another issue that gensim tests in some cases generate files at test time instead of relying on pre-generated binary files. Perhaps some of the other tests could be changed to do that too.

For the cases where data is needed at test time, perhaps each data set could be in a separate directory and have a README alongside it detailing the provenance, copyright holders and licensing of each data set.

Some of the test data might no longer be needed and thus could be removed.

Yes, the test data could use a clean up. There are open tickets around that such as #2967. But honestly low priority, so I have no idea when we'll get to it.

I have no capacity to hunt for licenses of the IMDB dataset (and others) unfortunately. IIRC they come from academic papers. If that's an issue for your task / employer, I'd suggest omitting them from your distribution. I don't think any of those files are necessary for Gensim to work. If I'm not mistaken, they are only there for CI testing + some of the tutorials (@mpenkov @gojomo CC).

I definitely think the directory deserves a clean-up, given the cruft that's accumulated, & think some largely-automated approach would be best, roughly:

long before any urgent release, run all tests from some volume or test-harness that detects all file accesses; mark those as 'preserved'
grep source code for patterns in doc-comments/etc that indicate test-data accesses, extract any so-named-files, & ensure those are marked as 'preserved'
delete all the files not preselected as 'preserved'
see if over the next few months, & test releases or minor point releases, if anyone complains, & if so, consider re-adding any related files

Generally, my default assumption is that whoever added data to this directory, at the time, believed there to be no copyright barriers to its inclusion & its use in this way. But, I couldn't assure that for any files I didn't personally add, as there's been no rigorous review.

As such data isn't quite 'source code', nor does it include any in-file, or near-file, claim of authorship or copyright, I don't believe there is any presumption or implied assertion that such files are themselves licensed under the LGPL. They're just riding along in an unspecified licensing state that's unlikely to rise to any level of liability/concern.

The data that appears to come via IMDB – 10 lines in the alldata-id-10.txt file – seems a tiny excerpt from a 50,000 review dataset that canonically originates from https://ai.stanford.edu/~amaas/data/sentiment/ but is widely mirrored elsewhere (Kaggle, Google TensorFlow, HuggingFace, etc). Both the manner in which it was freely offered (without formal copyright or licensing declarations) for academic/research purposes, & the community practice of widespread mirroring/use, make me believe any relevant rightsholders approve. But even if they objected, 'fair use' standards that are strong in the US, with some analogues elsewhere, would suggest a use of this scale/purpose sidesteps copyright concerns.

A similar analysis applies to the simlex999.txt file.

The only data I notice that appears to have possibly originated at Wikipedia are some brief article excerpts in the 11yo files para2para_text1.txt & para2para_text2.txt. I'm not sure these are in use anymore - a Github search for [para2para_text1] shows no references in current code. If we did want to include Wikipedia excerpts and be fastidiously compliant, it might be enough to add a small para2para_texts.readme note alongside them, to the effect "These texts are excerpts from a contemporaneous Wikipedia(link) dump, and thus remain derivative works under Wikipedia's license(link)."

Thanks for the investigation @gojomo. That matches what I remember – a non-issue except for highly theoretical what-if scenarios. Which, while valid, are zero priority for me right now.

But if anyone wants to take this up, I'm willing to offer a review :)

@pabs how badly does your employer need this resolved?

I agree for now there isn't really anything to be done with this issue, but thanks for the followups, some further thoughts below.

Agreed that none of these files are likely to have any liability concern, but my main concern here is that their license isn't compatible with the Debian Free Software Guidelines or worse, that they aren't redistributable at all unless redistributing without a license and then relying on fair use to avoid liability.

For Debian the default assumption for files that have no clear licensing attached is that they were either created and owned by the project and are under the same license as the rest of the project, or if there are indicators of originating elsewhere then they are probably All Rights Reserved when they were gathered. Especially in the case of machine learning datasets, where it seems they are usually pulled from websites without consulting with or having a license from the end users of the websites who added the data and are thus presumably the copyright holders. Often the ToS of the website (which most users do not read or really consent to) will have a clause about the website retaining a license to redistribute, but that doesn't necessarily apply to researchers and doesn't necessarily apply to redistributors downstream from the researchers.

Unfortnately Debian and probably other redistributors cannot rely on the "fair use" concept, it is not universal world-wide; for eg here in Australia we have instead "fair dealing", which is much more restrictive and doesn't allow the sort of use that is being suggested to be fair use. The fair use concept also probably does not deliver all of the freedoms required under the various definitions of libre software; the Free Software Definition, Open Source Definition and the Debian Free Software Guidelines (which the OSD was based on). For example IIRC the "commercialness" of a particular use factors into the tests for determining if fair use applies. It is also not a license, just a defence against infringement to be used in court.

These other files definitely look like Wikipedia extracts. They are all compressed and UTF-16, which is probably why GitHub can't find them.

bgwiki-latest-pages-articles-shortened.xml.bz2 enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2 enwiki-table-markup.xml.bz2

-- bye, pabs

https://bonedaddy.net/pabs3/

piskvorky / gensim

provenance, copyright holders and licensing of `gensim/test/test_data/`? #3324