github / choosealicense.com

A site to provide non-judgmental guidance on choosing a license for your open source project
https://choosealicense.com
MIT License
3.66k stars 1.33k forks source link

Test that license texts match SPDX plain license texts #636

Open mlinksva opened 5 years ago

mlinksva commented 5 years ago

We should have a test that each license text in _licenses is the same as the plain text license in the SPDX collection to automate the requirement described at https://github.com/github/choosealicense.com/blob/gh-pages/CONTRIBUTING.md#adding-a-license

The text of the license should match the corresponding text found in spdx/license-list-data. If there are errors there, please fix them in spdx/license-list-XML (from which the plain text version is generated) so as to minimize license text variation and make it easier for choosealicense.com to eventually consume license texts directly from SPDX.

The test could clone spdx/license-list-data and compare each license we have cataloged in this project. Many existing licenses would probably have to be marked as expected failures due to bugs in SPDX output and discrepancies in how this project has cataloged some licenses. But we should address upfront for any new license cataloged here, and continue to chip away at the existing inconsistencies.

travi commented 5 years ago

The latest SPDX version changes the text of the MIT license slightly, compared to the version currently on choosealicense.com. Do you have plans for how you want to handle old and new versions of licenses that change over time?

I use spdx-license-list when scaffolding new projects and the latest version updates its list to v3.4, which includes the change. Since updating to this version my new projects show unrecognized licenses, such as this one.

mlinksva commented 5 years ago

@travi thanks for pointing that out. The change is the optional text added at https://github.com/spdx/license-list-XML/commit/ca17b9160aab8acce72fe62720e63f68a782406b#diff-a3960b442eb635386ec51a5d6d15af2d

For better or worse SPDX doesn't AFAIK distinguish between optional but not usual and optional but preferred text, and outputs all optional text in https://github.com/spdx/license-list-data/blob/2d27e4c31441af8f343eba0293d03d27707d9c02/text/MIT.txt

I don't think we can or should move to MIT including the optional text here. Can because that would cause license detection problems for most existing MIT licenses given the way licensee (which GitHub uses) is tied to texts curated here (choosealicense.com). Should because I'd rather encourage adoption of the most widely used text, which doesn't include the optional text added.

There are tons of variations on MIT text, I've linked a paper about that a few times.

I don't have a plan to implement, but here's what I'd like to see:

Since updating to this version my new projects show unrecognized licenses

I would recommend not including the optional text now published by SPDX.

For anyone who insists on doing that, yes, GitHub will identify that there is a license, but not what it is and show in "View license" rather than "MIT".

Presently the only way for licensee to deal with optional text is to normalize it away before matching, but I think it would need to be a super common variation to justify doing that. Feel free to open an issue in licensee/licensee if you want to pursue there.

reversi-fun commented 5 years ago

My tool may help this issues. I tried using my tools to compare with SPDX plain license text.

The two enhancements will make it easier to test continuously.

My tool can output the degree of similarity between documents and the number of words using a library called gensim. For example, we could automatically find out the similarity of the license text below. {spdx/LGPL-3.0-only, spdx/LGPL-3.0-or-later,research/choosealicense.com-gh-pages/_licenses/lgpl-3.0}

Currently there is no spdx/LGPL-3.0. You can find spdxIDs with similar license texts, even if the file names in _licenses/lgpl-3.0.txt are more incorrect name.

For example, the similarity between the following two license files was 0.796, and the difference in word count was +130. lic-lgpl3

The difference in the number of words is the number of words in the header section. My tool marks the license name containing the word "PATENT" in red. the File _licenses/lgpl-3.0 contains "Contributors provide an express grant of patent rights" at header-description section. The above box(_licenses/lgpl-3.0) would not have been marked red if plain text without a leading part was entered.

The comparison results for all other license texts are as follows. https://github.com/reversi-fun/license_doc_similality1/blob/master/data/lic_graph.fdp.svg

You can confirm that choosealicense.com-gh-pages/_licenses and "spdx license plain texts" were all similar by the following search.

mlinksva commented 4 years ago

@darkmorpher licensee can recognize both the GNU hosted text and SPDX version. You're pointing to a non-master branch in the MT repo. There is no license or copying file in the master branch in the root, that's why no license is detected. If you find a bug that you can reproduce in licensee, please open an issue in the licensee repo.

sschuberth commented 4 years ago

We should have a test that each license text in _licenses is the same as the plain text license in the SPDX collection

IMO this is not a desirable goal as long as SPDX tampers with the original plain text version (if any) of a license, also see https://github.com/spdx/license-list-data/issues/44. This is because SPDX does not take a plain-text license as-is, but regenerates it from its own XML representation (as also described in the original post of this issue).

mlinksva commented 4 years ago

@sschuberth yes I'm well aware of that. As I've written before (but am too lazy to search for now) I'd love to see the SPDX plaintext renderings be as close to the canonical plain text version of licenses, and have over the years contributed a few small fixes toward that. As I wrote in the issue comment above:

Many existing licenses would probably have to be marked as expected failures due to bugs in SPDX output and discrepancies in how this project has cataloged some licenses. But we should address upfront for any new license cataloged here, and continue to chip away at the existing inconsistencies.

darkmorpher commented 1 year ago

RE: @mlinksva

https://github.com/github/choosealicense.com/issues/636#issue-398608094

(If still an issue) As a test case, Can you add one of these GitHub Actions to compare plaintext license and spdx data files in a new branch? Granted, all required files will have to be copied there too and repo will end up with duplicate files.

↔️ Spoiler (Click here)

(Previous discussion, may no longer apply) I should mention that slightest change in texts (formatting/punctuation) trips up licensee/licensee and fails recognition.

Edit: per reply, added/UPDATED example: GNU hosted text: https://www.gnu.org/licenses/gpl-3.0.txt SPDX version: https://raw.githubusercontent.com/spdx/license-list-data/master/text/GPL-3.0-only.txt

GNU hosted text is unrecognized by licensee>> as seen here: repository master

Essentially anyone grabbing a GPL license copy from the GNU site will have this issue (especially GitHub imported/mirrored repositories hosted on GNU Savannah's git repository server)

This is essentially different than adding an attribution header, where more complex detection method is needed.

Related: licensee/licensee/issues/387 | licensee/licensee/issues/416