spdx / license-list-XML

This is the repository for the master files that comprise the SPDX License List
Other
337 stars 264 forks source link

Formatting of plain license text in JSON data is broken #1924

Open goneall opened 6 years ago

goneall commented 6 years ago

Moving issue from SPDX tools. Originally submitted by @sschuberth

At the example of Apache-2.0, when extracting the licenseText string to a file, I'd expect that file to be exactly formatted like the original plain text license including leading spaces and blank lines. However, the JSON string is formatted like

Apache License

Version 2.0, January 2004

http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

(note the missing leading spaces but added trailing spaces) which not only does not match the original text but also is quite ugly.

goneall commented 6 years ago

The way we are maintaining the license information in the license-list-XML github repository it is not feasible to retain the formatting of the original text since XML removes the white space and we do not have enough tags to retain all of the formatting.

That being said, we could do a better job of formatting the text and making it look prettier.

The code for this has actually moved to a different project: LicenseListPublish

sschuberth commented 6 years ago

The way we are maintaining the license information in the license-list-XML github repository it is not feasible to retain the formatting of the original text

I believe that's exactly the problem then, and XML shouldn't be used as the primary format to store the original text. It could still be used as a format to apply SPDX-specific formatting, however.

jeffmcaffer commented 5 years ago

+1 to maintaining the original format of the licenses. Currently, for example, these two license texts differ by newlines. https://github.com/spdx/tools/blob/master/resources/stdlicenses/MIT.jsonld https://github.com/OpenSourceOrg/licenses/blob/master/texts/plain/MIT

While it is not a huge deal for consumers to find some wordwrapping implementation and run the text through before, say, generating a NOTICE file, it is extra hassle and will lead to apparent differences. Would be great to generate clarity and simplicity around licenses by using the same canonical form everywhere.

goneall commented 3 years ago

Resolves in PR spdx/LicenseListPublisher#83

sschuberth commented 2 years ago

I'm reopening this to remind myself that the issue hasn't really been fixed yet. While PR spdx/LicenseListPublisher#83 laid the foundation for getting it fixed, https://raw.githubusercontent.com/spdx/license-list-data/b8d6af45ad2fcfed61bb85a8ad068aa4a77eadf9/text/Apache-2.0.txt still does not match https://www.apache.org/licenses/LICENSE-2.0.txt formatting-wise.

IIUC @goneall correctly, the remaining thing to do is to commit the original / upstream plain text licenses to https://github.com/spdx/license-list-XML/tree/master/test/simpleTestForGenerator and then rerun this publisher to make the correct licenses show up at https://github.com/spdx/license-list-data/tree/master/text. I'll try to wrote a script for that to finally resolve this long-stand issue.

goneall commented 1 year ago

@sschuberth - Just going through the older issue. Any thoughts or progress on updating the text in the license-list-XML repo?

sschuberth commented 1 year ago

Sorry @goneall, this issue has slipped my mind. But would you agree that the mentioned approach is the way to go:

the remaining thing to do is to commit the original / upstream plain text licenses to https://github.com/spdx/license-list-XML/tree/master/test/simpleTestForGenerator and then rerun this publisher to make the correct licenses show up at https://github.com/spdx/license-list-data/tree/master/text.

goneall commented 1 year ago

@sschuberth I agree with the above approach.

I'll move this issue over to the license-list-XML repo since this is where the work will be done.

@swinslow @jlovejoy FYI - if you disagree with updating the test text to fix the formatting in JSON, please add to this issue and cc @sschuberth

jlovejoy commented 1 year ago

@sschuberth @goneall - I'm not sure I'm following the implementation details here, but I think the goal is to get to a point to where the text files at https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator are "formatted" to look or reflect any original text file for a given license (e.g, https://www.apache.org/licenses/LICENSE-2.0.txt ) or at least has some form of line length limit to avoid horizontal scrolling?

if we do that, then the formatting will show up better at https://github.com/spdx/license-list-data/tree/master/text.

is that right-ish?

I'm all in favor of better formatting such that people can "reuse" text files. I think we need to document which text file directory is the best to use as well.

Also, keep in mind that the text files created in https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator are created as part of the PR when the license is accepted to the SPDX License List. We have a GSoC project that would add functionality to create this text file automatically via the online submission tool, instead of people having to create it manually. So, any formatting parameters should be included for that project.

goneall commented 1 year ago

@jlovejoy

if we do that, then the formatting will show up better at https://github.com/spdx/license-list-data/tree/master/text.

Close - the specific issue is related to the JSON files, but the formatting for JSON and the text files is the same source

Sounds like you're in general agreement