spdx / LicenseListPublisher

Tool that generates license data found in the license-list-data repository from the license-list-XML source
Apache License 2.0
11 stars 18 forks source link

More useful information on expected-warnings #117

Open m1kit opened 3 years ago

m1kit commented 3 years ago

This file is useful not only during testing, but in the context of license matching.

Though the file owner is spdx/license-list-XML, I think the main user of the file is this repo. I'd like to leave some discussion here.

See the comments here for the details.

goneall commented 3 years ago

I like the idea of adding a more structured file for expected duplicates.

Since the LicenseListPublisher is used by only a small number of organization, we don't need to worry too much about compatibility.

One question - should we: A) "expected duplicates" JSON file that this utility would check and not generate any warnings for duplicate licenses listed in the JSON file, or should we B) take a more general approach of changing the format of the expected warnings file to be a JSON file which would contain the expected duplicates but also contain other sections of expected warnings?

Approach A) may be more usable by other utilities whereas B) make more sense for the LicenseListPublisher.

I'm leaning to A) to make the file format more usable to other utilities.

@m1kit - what do you think?

m1kit commented 3 years ago

Oh, I was also writing some similar ideas at the same time😂 Thanks anyway, @goneall !

I have three ideas in my mind.

JSON (Similar to your plan B)

One possible format is JSON like this

[{
  "type": "duplicated-license",
  "license-ids": [
    "LGPL-2.1",
    "LGPL-2.1-only"
  ],
  "prefer": "LGPL-2.1-only"
},
 {
  // more expected warnings here
}]

This format is flexible to any future updates (new expected warning types). We may add some data for simplicity in the publisher like:

[{
  "type": "duplicated-license",
  "license-ids": [
    "LGPL-2.1",
    "LGPL-2.1-only"
  ],
  "warnings": [
    "Duplicates licenses: LGPL-2.1, LGPL-2.1-only",
    "Duplicates licenses: LGPL-2.1-only, LGPL-2.1",
  ]
  "prefer": "LGPL-2.1-only"
}]

It's like a hybrid of your Plan A and B.

CSV (just another format of JSON)

Maybe it is not easy to parse JSON in Java.

We may store data in CSV format like... (but not flexible)

"message","from","to","prefer"
"Duplicates licenses: LGPL-2.1, LGPL-2.1-only","LGPL-2.1","LGPL-2.1-only","LGPL-2.1-only"
"Duplicates licenses: LGPL-2.1-only, LGPL-2.1","LGPL-2.1-only","LGPL-2.1","LGPL-2.1-only"

XML (Similar to your plan A)

I think the data here is related to obsoletedBys in license-list-XML. I wonder to define similality of templates somehow in the XML.

Then we can pull data from XML and generate expected-warnings dynamically in a format specific to LicenseListPublisher.

m1kit commented 3 years ago

I forgot to mention my preference.

I think adding some info on XML is the best, if possible.

Or, we can make some generic expected duplicate in separate file somewhere in license-list-XML and dynamically generate a file for this library.

goneall commented 3 years ago

Or, we can make some generic expected duplicate in separate file somewhere in license-list-XML and dynamically generate a file for this library.

I like this idea as it would make the information more generally accessible and usable. We could replace the current expectewarnings file with an "KnownDuplicates.xml".

Although I tend to like JSON better than XML due to readability, the fact that the license-list-XML repo is primarily XML format would favor the XML format over JSON.

We can update this library to read the XML file and process it directly.

I'm tempted to just remove the expected warnings functionality since it is only currently used for known duplicates.

I would like the XML to deserialize into a Java object using one of the standard libraries without too much effort. Here's what I'm thinking might work (although I would want to test this out in code before finalizing):


<expectedDuplicates>
   <duplicatedLicenseSet>
       <licenseIds>
          <licenseId>LGPL-2.1</licenseId>
          <licenseId>LGPL-2.1-only</licenseId>
          <licenseId>LGPL-2.1-or-later</licenseId>
      </licenseIds>
      <prefer>LGPL-2.1</licenseId>
      <comment>The LGPL-2.1-only should be used if only the 2.1 version of the license is allowed, the LGPL-2.1-or-later should be used if any later version of 2.1 may be used.  If unsure which applies, the LGPL-2.1 identifier should be used</comment>
    </duplicatedLicenseSet>
</expectedDuplicates>
m1kit commented 3 years ago

Hi, I agree with "KnownDuplicates.xml" idea.

I'd like to work on this - introduce the file on license-list-XML. I have a few questions about how-to.

goneall commented 3 years ago

I'd like to work on this - introduce the file on license-list-XML.

That would be great :)

I have a few additional suggestions on the file I've been thinking about - I'll add those as separate comments.

Do I have to write .xsd to define the schema?

A schema would be really nice to have for validating and even generating code.

If so, what is a recommended way to write a .xsd file?

There are a number of ways to create the XSD file. Since we need to change the Java application to use the XSD file, I have a suggested approach:

goneall commented 3 years ago

I would like to suggest we broaden the scope of the XML file to include other potential license issues which generate warnings in the LicenseRDFaGenerator. If we merge in PR #20 , there will be more expected warnings where the OSI approved flag doesn't match the OSI data.

I would like to name the file something different from "expectedwarnings" since I would like the file to be usable for a number of other purposes. Perhaps something like "KnownLicenseIssues.xml"?

goneall commented 3 years ago

I did some quick analysis of warning sources to see if we want to include any additional sections in the XML file for expected license issues.

The only one I think we should add is something to describe a list of license ID's where the OSI Approved flag doesn't match the OSI provided data (see PR #20 for context).

Below are other warnings which can be added as sections, but are not as likely to occur:

goneall commented 1 year ago

@m1kit - It's been a while for this issue - are you still interested in contributing? If not, I'll close the issue.