spdx / license-list-XML

This is the repository for the master files that comprise the SPDX License List
Other
352 stars 286 forks source link

Consider differentiating customizable text from simple variation-handling? #2606

Open ferdnyc opened 1 week ago

ferdnyc commented 1 week ago

Uses of <alt>

Currently, the <alt> tag is used in two fundamentally different capacities.

As a field to mark customizable text

Many of the license texts include customizable details related to the project being licensed, like project name, copyright statements, maintainer or rightsholder addresses, etc.

These are frequently wrapped in an alt tag that will match anything (match=".*"), although a few have more specific matching patterns. A variable name is always specified (name="something") to capture the matched string.

For example, in BSD-2-Clause.xml, the text specifying "THE COPYRIGHT HOLDER(S) (AND|OR) CONTRIBUTORS" are made customizable in two places, captured as copyrightHolderAsIs and copyrightHolderLiability:

https://github.com/spdx/license-list-XML/blob/9269d7211fba83092697d5211c5f81988222ec84/src/BSD-2-Clause.xml#L30

https://github.com/spdx/license-list-XML/blob/9269d7211fba83092697d5211c5f81988222ec84/src/BSD-2-Clause.xml#L33

In Python-2.0.1.xml, the specific Python version for which the license applies is similarly captured in a number of places, using a more specific regular expression:

https://github.com/spdx/license-list-XML/blob/9269d7211fba83092697d5211c5f81988222ec84/src/Python-2.0.1.xml#L35

https://github.com/spdx/license-list-XML/blob/9269d7211fba83092697d5211c5f81988222ec84/src/Python-2.0.1.xml#L84

To support minor variations in license texts

Other uses of alt tags aren't free-form/customizable at all, but merely prevent slight variations in license text from causing a failure to match the license. Going back to BSD-2-Clause.xml, the word "EXPRESS" is surrounded by an alt tag not because it's customizable, but simply because some versions of the text contain "EXPRESSED" instead of "EXPRESS":

https://github.com/spdx/license-list-XML/blob/9269d7211fba83092697d5211c5f81988222ec84/src/BSD-2-Clause.xml#L31

The same is true in xpp.xml, where "University" may be misspelled as "Univeristy", and where a certain conjunction may be either "and" or "or" (but nothing else):

https://github.com/spdx/license-list-XML/blob/9269d7211fba83092697d5211c5f81988222ec84/src/xpp.xml#L40

Other tags handled as replaceable

This also applies to e.g. <copyrightText> and <bullet>, which are presented identically as red replaceable text, despite having different purposes.

It makes sense for a project to customize the copyright text of its license as needed, so <copyrightText> can fairly be treated like the first category of <alt> tags above.

But the bullets used in the license are more akin to the second type of <alt> tag above, in that there are a fairly limited set of possibilities for what might be found in their place. The 1. before the first clause in BSD-2-Clause.xml might be replaced with 1), or 1 —, or even nothing if the list is numbered automatically, but it probably shouldn't be replaced with 45) or apple).

Presentation of <alt>

Software doesn't care about the purpose of a given <alt> tag, and for the purposes of matching the information that it's replaceable is sufficient. But to humans, the implications of the two types of "replaceable" text are unlikely to be the same. And because these two very different situations are handled the same way in the code/data, they're also presented the same way on the website. All replaceable text is presented in red, so the handling of variations appears to indicate that unexpected bits of text are customizable or free-form, when in fact they're not.

In the display of BSD-2-Clause.xml, for example, it seems potentially confusing for "EXPRESS" to be shown in the same red text as "THE COPYRIGHT HOLDERS AND CONTRIBUTORS" — at least, without also providing some explanation for why and how someone would "customize" the word EXPRESS:

image

Since optional and replaceable texts are indicated to humans by coloring the text blue or red, respectively, it's presumably of some value to highlight those locations. But if "there could be anything here" freely-customizable areas of the text, and other areas where only a very limited set of options will pass, are all presented the same, it seems as though that value could be somewhat reduced?

swinslow commented 1 week ago

Hi @ferdnyc, thanks for your detailed thoughts here!

I agree, and this is something that has been sitting in the back of my mind for some time now. The specific variations encoded in the regular expressions for <alt> tags are important, and I understand that some downstream projects (such as Fedora) are handling these.

But I suspect that most people aren't seeing the regexes from this repo, or from license-list-data, and are instead just viewing the website versions at https://spdx.org/licenses. And as you noted, nothing in that HTML view clearly indicates whether the red text for a given <alt> tag (or <bullet>, etc.) is "replace with anything" or "replace with these specific characters."

I haven't had a chance to dig into this, but I'm certainly open to us coming up with a cleaner solution. Here are a couple, feel free to share others:

  1. When a red text field is hovered over, it could pop up a tooltip showing the different regex values
  2. When a red text field is clicked on, it could open a view (maybe something other than a tooltip?) showing the regex
  3. And/or, each HTML page could include a link directly to the corresponding XML file in license-list-data

(Option 3 is probably a good idea, regardless of whether we also do 1 or 2)

The code that generates the License List website is available at https://github.com/spdx/licenseListPublisher. Specifically https://github.com/spdx/LicenseListPublisher/tree/master/resources/htmlTemplate contains the corresponding HTML templates, if there are suggested edits you'd like to propose.