spdx / LicenseListPublisher

Tool that generates license data found in the license-list-data repository from the license-list-XML source
Apache License 2.0
11 stars 18 forks source link

Add license text from the test files to resolve text formatting issues #83

Closed goneall closed 3 years ago

goneall commented 3 years ago

Signed-off-by: Gary O'Neall gary@sourceauditor.com

goneall commented 3 years ago

Resolves the following issues:

sschuberth commented 3 years ago

BTW, I finally gave this a look, and IMO it does not fix the issue properly. Just compare e.g. https://github.com/spdx/license-list-data/blob/master/text/Apache-2.0.txt to https://www.apache.org/licenses/LICENSE-2.0.txt. For example all the leading indentation is stripped, so the formatting is still broken compared to upstream.

goneall commented 3 years ago

I went back through the code and found 2 issues:

1) There was still some formatting being done to word-wrap the text files. 2) The Apache-2.0 test file is not the canonical license text. see https://raw.githubusercontent.com/spdx/license-list-XML/master/test/simpleTestForGenerator/Apache-2.0.txt

I can easily fix 1 above - I'll create a separate PR.

For 2, the License-List-XML repo will need to be updated with the correct text. I looked at other licenses and most of them have had the line breaks for word-wrapping removed from the original text. Fixing these would require someone (or someones) to go through and replace the test text with the canonical text - a rather large effort.

sschuberth commented 3 years ago

Fixing these would require someone (or someones) to go through and replace the test text with the canonical text - a rather large effort.

I could probably help with this. But speaking about this, I've always wondered why the files in https://github.com/spdx/license-list-data/blob/master/text/ aren't simply copies of the canonical upstream texts, and why the test in https://raw.githubusercontent.com/spdx/license-list-XML/master/test/simpleTestForGenerator/ doesn't simply use those files (e.g. included as a Git submodule). It seems odd to me that currently, the only place where plain copies of the canonical upstream texts are used, is a repository called "license-list-XML".

goneall commented 3 years ago

I could probably help with this.

That would be great 👍

I've always wondered why the files in https://github.com/spdx/license-list-data/blob/master/text/ aren't simply copies of the canonical upstream texts, and why the test in https://raw.githubusercontent.com/spdx/license-list-XML/master/test/simpleTestForGenerator/ doesn't simply use those files (e.g. included as a Git submodule).

The reason is there isn't a repository of upstream texts to reference. It would take quite a bit of effort to create such an repository for the hundreds of files.

In many cases, the files stored in license-list-data test directory are copies of the upstream text.

The proposal is that we just use the license-list-data test directory files as the upstream representation.

The tools that generate the license-list-data have already been updated to just copy the text from https://raw.githubusercontent.com/spdx/license-list-XML/master/test/simpleTestForGenerator/ to the license text.

This PR just removes the word-wrapping being done against the copies. Once we merge this PR, it "should" just be an exact copy of the files in license-list-data test directory.

goneall commented 3 years ago

I created a PR in the license-list-XML repo to recommend that plain text test files should match the text and formatting of the original license: https://github.com/spdx/license-list-XML/pull/1160

Feel free to comment on any process related suggestions in the PR.