spdx / tools

SPDX Tools
Apache License 2.0
123 stars 68 forks source link

Field addition on json files #237

Closed tjasmith closed 4 years ago

tjasmith commented 4 years ago

Added the commons-validator dependency to handle the verification of the validity of urls. The urlHelper class contains functions used to determine link validity and live-status.

I have tested this locally by installing the tools repo on my local maven repo using the command: mvn install:install-file -Dfile=spdx-tools-2.2.2-SNAPSHOT-jar-with-dependencies.jar -DgroupId=org.spdx -DartifactId=spdx-tools -Dversion=2.2.2 -Dpackaging=jar

Then I built the LicenseList publisher repo and ran: java -jar licenseListPublisher-2.1.22-jar-with-dependencies.jar LicenseRDFAGenerator /path_to_dir_containing_xml_licenses /path_to_dir_where_generated_data_should_be_saved

And in the generated json files, here is an example:

"licenseId": "0BSD",
  "seeAlso": [
    {
      "isValid": true,
      "isWayBackLink": true,
      "extraText": true,
      "isMatch": true,
      "url": "http://landley.net/toybox/license.html",
      "isDead": false
    },
    {
      "isValid": true,
      "isWayBackLink": true,
      "extraText": true,
      "isMatch": true,
      "url": "http://landley.net/toybox/license2.html",
      "isDead": true
    },
    {
      "isValid": false,
      "isWayBackLink": true,
      "extraText": true,
      "isMatch": true,
      "url": "http://invalidURL",
      "isDead": true
    },
    {
      "isValid": true,
      "isWayBackLink": true,
      "extraText": true,
      "isMatch": true,
      "url": "https://threejs.org/examples/webgl_objconvert_test.html",
      "isDead": true
    }
  ],
  "isOsiApproved": true

@goneall Please have a look.

goneall commented 4 years ago

After seeing the code, I think we may end up breaking some of the existing code with the changes to the datastructure.

There are a couple solutions that come to mind. We could keep the existing seeAlso array as is and add a parallel datastructure with all of the details. I'm not sure what the best name would be, perhaps seeAlsoUrlDetails. I think it would be OK to duplicate the URL in the details datastructure.

goneall commented 4 years ago

The SPDX tools should only be reading the data that is stored in the license list JSON files. Any production of the actual data should be done the by the LicenseListPublisher. The LicenseListPublish will be run on each release of the license list which should be frequent enough. Since the tools library may be invoked frequently, we would want to just read the stored data produced by the license list publisher for performance reasons.

We still need to change this library to be able to read the data and store it in the license object, but the code should just be reading the JSON assuming the fields are already there.

The writeLicenseList method will generate the JSON as well as other formats. This would be a good place to create the data and store it in the JSON file.

We can change the writeLicense method in the ILicenseFormatWriter interface to take a datastructure with the additional URL details collected.

This would require changes to the other formats (e.g. RDF), but we probably should include the information in those serialization formats anyway.

tjasmith commented 4 years ago

@goneall I don't quite get your point.

You want the data structure to be added on the LicenseList publisher repo, and that the spdx-tools just takes into consideration the updated license object and write the result to the format required?

goneall commented 4 years ago

You want the data structure to be added on the LicenseList publisher repo, and that the spdx-tools just takes into consideration the updated license object and write the result to the format required?

The data structures can be added in the tools and referenced in the LicenseListPublisher, but filling in the values should be moved to the LicenseListPublisher (e.g. the calls to validate the URL and do text matching).

A bit more context. The LicenseListPublisher tool takes as input the data in the License-List-XML repo and generates the files in the license-list-data repo as well as the SPDX license list website. The SPDX tools take this data and provides programatic access to the data for Java tools.

Since we're adding fields to the License class, we'll need to update this repo to allow access to the additional fields. We'll also need to update the classes that read the field information from the License JSON files.

Most of the work, however, will be in the LicenseListPublisher since that is where we want to do all the calculations for the field values. The reason to put this in the LicenseListPublisher rather than the SPDX Tools is we don't want to pay the performance penalty nor require online access when reading the license information.

@tjasmith We can setup a call today or tomorrow if you would like more information.

tjasmith commented 4 years ago

@goneall If there is any other thing I don't understand, I will let you know.