spdx / license-list-XML

This is the repository for the master files that comprise the SPDX License List
Other
344 stars 278 forks source link

JSON and website inexactly match for AGPL-1.0 which forbids non-verbatim copies #2358

Open workingjubilee opened 8 months ago

workingjubilee commented 8 months ago

THE VERY SHORT VERSION: Translating XML to JSON seems to result in significant differences between the JSON and rendered website text.

I printed the JSON text data from https://github.com/spdx/license-list-data/blob/main/json/details/AGPL-1.0.json using a Rust program after applying the transformation of the \u2007 escaping sequence to a Rust-recognized \u{2007} sequence. Later experiments with JS REPLs seem to yield an exactly matching text output. I acquired this: LICENSE.txt. Yet this is different from what the website renders, because the website's rendered version looks like:

AFFERO GENERAL PUBLIC LICENSE
Version 1, March 2002

Copyright © 2002 Affero Inc.
510 Third Street - Suite 225, San Francisco, CA 94107, USA

However, the JSON-tripped version is:

AFFERO GENERAL PUBLIC LICENSE
Version 1, March 2002 Copyright © 2002 Affero Inc. 510 Third Street - Suite 225, San Francisco, CA 94107, USA

Note that both get the first line right and then start on the same second line but then disagree on the next three. The JSON data for `"licenseText" up to that point is the following:

"licenseText": "AFFERO GENERAL PUBLIC LICENSE\nVersion 1, March 2002 Copyright © 2002 Affero Inc. 510 Third Street - Suite 225, San Francisco, CA 94107, USA\n\n

The XML data looks like:

      <titleText>
         <p>AFFERO GENERAL PUBLIC LICENSE
          <br/>Version 1, March 2002 </p>
      </titleText>
      <p>Copyright © 2002 Affero Inc.
      <br/> 510 Third Street - Suite 225, San Francisco, CA 94107, USA</p>

That is, it includes a pair of <br/>s here, one in each <p></p> pair, which I believe is accounting for the rendered spacing on the website. This causes copying the version from the website to get a LICENSE-RIGHTCLICK.txt and running that through tools like askalono to return an inexact match, despite being, as far as I know, an exact copy!

Note that the AGPL 1.0 has the clause:

"Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed."

I have excerpted this quote in a standard citational form but I have not added emphasis because, as the license says... changing it is not allowed. This suggests one of the two forms, the XML-encoded text, or the JSON string, is meaningfully incorrect, as they render to substantively different displayed text by typical renderers for their encoding.

I have no idea if this actually matters, of course. I am not a lawyer, this is not legal advice, etc. etc. etc. However, it seems that the generation of the JSON data from the XML masters may be dropping important formatting details, and it would not seem strange to me if a legal case, however frivolous-seeming, hinged on this difference, given how many cases have been decided on the presence or absence of commas.

This seems to have fellow issues in, but does not seem to be an exact duplicate of,

The reason why it does not seem to be an exact copy of #1924 is that it seems like all the data necessary to achieve a replication of the website's formatting is there in the XML, but not in the JSON, and that the checked-in test data seems to be derived from a JSONified-first form?

This could also be, say, an HTML vs. XML difference.

goneall commented 8 months ago

The text in the JSON file actually come from a text file and not the XML.

For context, please refer to this pull request for the tool that generates the JSON and website from the XML and test data: https://github.com/spdx/LicenseListPublisher/pull/83

If the JSON data is incorrect, then the test data is incorrect.

BTW - there is a flag in the LicenseListPublisher tool to generate the JSON file from the XML instead of the test data. If we change the switch, it will reopen many issues raised in the above mentioned pull request.

workingjubilee commented 8 months ago

Referencing the Wayback Machine archive for http://affero.org/oagpl.html on 2006-01-05 gives me this:

AFFERO GENERAL PUBLIC LICENSE

Version 1, March 2002

Copyright © 2002 Affero Inc.
510 Third Street - Suite 225, San Francisco, CA 94107, USA

From this HTML:

     <td width="99%" valign="Top" align="Center">
      <div align="Left">
      <p><b><big><big>AFFERO GENERAL PUBLIC LICENSE</big></big></b><br>
      </p>
      <p><big>Version 1, March 2002</big><br>
      <br>
                    Copyright © 2002 Affero Inc.<br>
                    510 Third Street - Suite 225, San Francisco, CA 94107,
 USA</p>

So yes, it seems that in this case:

Obviously, no one is really using the AGPL 1.0 for new work right now, indeed as far as I am aware it was never very popular, and then the AGPL 3.0 happened only a few years later. But that was why I chose it as an initial test case: it's fairly easy to reference its canonical version, and I had, at the time, figured its lack of popularity meant there wouldn't be as much dispute over its exact contents, which is an issue that plagues e.g. MIT, the various BSD-N-clauses, etc.