unicode-org / unilex

Lexical data at Unicode
Other
63 stars 16 forks source link

Publish under dual Unicode AND Open License #10

Open hugolpz opened 3 years ago

hugolpz commented 3 years ago

Your data is an impressive work which could help many, many minority and rare languages to get stronger online representation.

The Wikimedia Foundation, Wikipedia, Wikidata, and @Lingua-Libre movements would love to use your data.

As far as I can see, your data is fully copyrighted but also release core rights :

The Unicode Consortium releases this data under the same license as all its other data files. Copyright © 1991-2020 Unicode, Inc. All rights

And the LICENSE.md file states :

Copyright © 1991-2017 Unicode, Inc. All rights reserved. Distributed under the Terms of Use in http://www.unicode.org/copyright.html. Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files and any associated documentation (the “Data Files”) or Unicode software and any associated documentation (the “Software”) to deal in the Data Files or Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or Software are furnished to do so, provided that either (a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated Documentation.

A visitor would first have to assume this as a proprietary license with full copyrights restrictions. (I believed so for the past 2 years). Actually, your LICENSE.md document mirrors closely the GNU license. The GNU license have been designed for softwares bundles. It was the first license used by Wikipedia, but it has then been drop by the Wikimedia community precisely because it ask to attach a page-long GNU document to all creations under GNU license. After long discussions, the Wikimedia Foundation decided to switch to Creative Commons License, which only require to attach a short acronym such as CCBY-4.0 + source to the created product. A much convenient way to spread open content in various contexts.

Request #11

Could you add a dual (second) CCBY-4.0 License to your License.md. The CCBY-4.0 license mirror your current document agreements, but in a more shareable way which can fit more diverse contexts. See pull request #11.

cc @macchiati @brawer

brawer commented 3 years ago

For use by Wikimedia, CC0-1.0 would actually be more useful than CC-BY-4.0. In particular, Wikidata Lexemes are licensed under CC0-1.0, not CC-BY-4.0. But that’s a minor detail.

@macchiati, what do you think? Would the Unicode consortium be open to dual-licensing Unilex under the Unicode Data License + Creative Commons Zero? (Even simpler: Change the current license to CC0-1.0). Personally I think this would help to make the Unilex data more useful. But I wouldn’t know whom to ask to make it happen.

macchiati commented 3 years ago

Thanks for the head's up. Let me talk to our lawyer about that.

hugolpz commented 3 years ago

I tried to replicate your Unicode license. But Yes. Wikidata will love it if you publish under CC0, which is like Public Domain. If your objective is impact and to support language diversity, go for it. That's what we do ourselves (Wikimedia).

FYI / Info sharing : Wikidata lexeme advancement is there : https://ordia.toolforge.org/language/ . As far as I know, Wikidata Lexeme needs the triad language + form + POS and then can create a lexeme. We still have to catch up with state of the art lists such as French's Lexique 3.83 and others, but we will get there.

macchiati commented 3 years ago

Unfortunately, our lawyer is swamped now (and will continue to be in the near future), so I can't say when we would be able to get to this.

hugolpz commented 3 years ago

Thank you for this push. License switching within institutions is a marathon. This is one more positive wave. 👍🏼

@macchiati, @brawer : I think we can rename this issue as "Request open license", consider the job done, and close this issue. We will return to it when the lawyer give some feedbacks. I don't see any point to keep a zombie issue here for a months or a year +.

brawer commented 3 years ago

We’re working on it. Let’s keep the bug open until either the license has been changed, or the request has officially been declined.

srl295 commented 8 months ago

@hugolpz As I understand it Unicode artifacts (including ICU and CLDR) have already been used by Wikimedia for quite some time with the ICU and now Unicode license. Unilex is now updated with the current (v3) Unicode license which is OSI approved as an open license.

@nemobis (if that's the right handle) do you have a comment from the Unicode/WMF side?

nemobis commented 8 months ago

@srl295 Thanks for the ping. I wasn't aware of this issue but I'll give a quick reply. I've only read the discussion above and the README. I can't speak for WMF, let alone Unicode (I don't remember whether WMF is even a member now), but I can tell about the usage of Unicode components in MediaWiki software and Wikimedia wikis.

The issue description highlights some confusion on the licensing of this project. Meanwhile the LICENSE has been updated to the Unicode license v3 which has been recently approved by OSI on 2023-11-17: https://opensource.org/license/unicode-license-v3/ . So there's no doubt this repository is opensource. Maybe this can be explicitly mentioned on the README, as not everyone is able to recognize the license text as its own OSI-approved Unicode v3 license.

MediaWiki can and does use software under Unicode license all the time, for example in the CLDR extension, which is primarily GPLv2, under the understanding that the CLDR data inside was under a BSD-like license. (Apertium linguistic data is also usually under GPL.) As long as Unilex can be used in GPL software, there are probably ways it can benefit all Wikimedia wikis through MediaWiki.

However @hugolpz seems most concerned about usage in Wikidata and other Wikimedia wikis content. From the README it sounds like this repository mostly wants to collect uncopyrightable factual information. In the EU, there might still be problems with database rights. A general opinion from the WMF on how to handle these is at https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights . In short, it's complicated, and it's easier to incorporate a dataset into Wikidata when it's already under CC-0. If there's some doubt on whether the data/ directory here as a whole is a dataset

If you want to cooperate with Wikidata lexemes in the future, it's worth considering how to make it easier. As for LinguaLibre, as far I understand it helps produce some recording which might be considered copyrightable, and it wants its outputs to be available under CC BY-SA, so it benefits from its sources being as permissive as possible.

Finally, I see that many files carry a SPDX-License-Identifier: Unicode-DFS-2016 header, which makes it easier to follow the Reuse guidelines. Note Richard Fontana's suggestion for trivial files at https://github.com/fsfe/reuse-docs/issues/62#issuecomment-1200305896 (and my personal opinion below it).

So in conclusion my personal suggestions are:

(This answer also archived/notified on https://lists.wikimedia.org/hyperkitty/list/mediawiki-i18n@lists.wikimedia.org/thread/7RYV4JPKL4XIDSDV5KBKFVWY6ZSQ7TPB/ .)

annebright commented 8 months ago

To add to Steven's comments, and to clarify, the Unicode License v3 (as well as the 2016 version) is based not on GNU but on the MIT license, which is about as open as it gets. The only difference between the MIT License and Unicode License is that the Unicode License expressly covers data files, which we see as an improvement over MIT, and presumably helpful to WikiMedia. We will not consider CC-0 or some other public domain declaration because we do not necessarily have the inbound license rights to do that.