softwaresaved / fuji

FAIRsFAIR Research Data Object Assessment Service
MIT License
0 stars 1 forks source link

[Feature]: [FRSM-15] Does the software source code include licensing information for the software and any bundled external software? #15

Closed karacolada closed 6 months ago

karacolada commented 7 months ago

D5.2 p21, p30

Detailed Description

Clear software licensing enables reuse.

Generic comments

Each community may have different licences that are popular.

It is important that software licences are included with the source code as many tools and processes look for licensing information there to determine licence compatibility.

The SPDX License List is a widely used part of the Software Project Data eXchange (SPDX) open standard. Information about the licence for a piece of software can be provided either as a file in the source code repository, or as a short identifier embedded in the source code files.

CESSDA comments

CESSDA guidance on licence information is part of the guidelines on Standard Git Repository Contents. Further guidance is provided as part of the guidance on CMA2 - Intellectual Property.

Context

R1.1: Software is given a clear and accessible licence.

Possible Implementation

requirements software source code, software
method Check the software and its documentation for the presence of a licence
essential The software includes its LICENCE file
important The source code includes licensing information for all components bundled with that software
useful The software licensing information is in SPDX format

CESSDA

requirements software source code, software
method Check that the LICENSE file exists. Check that the source code headers include a licensing statement.
essential Include a LICENSE.txt file in the root of the repository.
important Include licensing information in the source code header.
useful The build script (Maven POM, where used) checks that the standard header is present in all source code files.
karacolada commented 7 months ago

The domain-specific assessment criteria don't match the general ones for "important" and "useful".

karacolada commented 7 months ago

What does "The source code includes licensing information for all components bundled with that software" mean really? Is it that all files should include licensing info or is it really about bundling, dependencies etc? I.e. is this information that I check by crawling multiple files or by reading the one central license file?

karacolada commented 7 months ago

Components = bundled dependencies

karacolada commented 7 months ago

Started developing a GitHub-specific harvester that checks for the license. Modified FAIREvaluatorLicense to consider that information as well.

karacolada commented 7 months ago

switching between domain-specific and "reference"/general using metric YAMLs

karacolada commented 6 months ago

Do the tests need to be inclusive downwards? I.e. should general-useful only pass is general-important has passed? Asking because the maturity level for the metric is simply the max.

karacolada commented 6 months ago

The title of the metric includes "bundled external software", but that doesn't match the CESSDA tests.

karacolada commented 6 months ago

Implemented CESSDA-specific essential test. I took it literally, as in it fails if it's not a file names LICENSE.txt at the root of the repository. Do we really want this?

karacolada commented 6 months ago

Implemented CESSDA-specific useful test. It only checks Maven POM files and fails if the build script is not configured to fail on missing headers. Do we want to be this strict, or are we ok if people just use the plugin and (I assume) get warnings if the license headers are not present?

To add other kinds of build scripts, I would need to know more about wherther and how they implement license header checking - seems like Pandora's box?

karacolada commented 6 months ago

Implemented CESSDA-specific important test. It utilises the build script test, assuming that if that passes, all source code files do have license headers.

The harvester checks for the main language and uses the GitHub Search API to look for code in that language. We then store the code in up to 5 of the found files. The license test then checks the region where it would expect the license header (first 30 lines of code) for the word "license". Is this ideal? Should we instead try and look for the license name found in test 1? But that makes the tests dependent on each other...

karacolada commented 6 months ago

Discussion

Some of these questions/discussion points are very specific, but I think talking through them will help guide future development,

  1. The maturity rating for a metric is determined by the highest maturity achieved across all tests. To me, it looks like if a test of maturity 3 has passed, the test for maturity 2 should also have passed. However, tests 2 and 3 of the general metric tests aren't really related. Are we ok with that or should tests be inclusive downwards?
  2. I'm super unclear on how to implement the generel metric test 2. Any ideas?
  3. The title of the metric includes "bundled external software", but that doesn't match the CESSDA tests. They do not look into bundled external software.
  4. I find the CESSDA-1 test a bit too strict. GitHub license files don't have a TXT suffix.
  5. Is the CESSDA-3 implementation too strict?
  6. For CESSDA-3: To add other kinds of build scripts, I would need to learn more about whether and how they implement license header checking. Do we want to consider further build tools?
  7. CESSDA-2 looks for the word "license" in the first 30 lines of 5 "sampled" source code files. Is that enough files? Should we look for a different term, i.e. the license name?
karacolada commented 6 months ago
  1. Maturity 3 should mean that all tests of lower maturities have passed - or rather, we would expect this. Something the community needs to decide? Tests do NOT have to be inclusive downwards.
  2. Not easily automatically assessible.
  3. Flag: Is it important that bundled external software licensing is known to be FAIR? Sort of part of the principles maybe, but does this apply to software as well or just generally to digital objects?
  4. Drop it for now and check with CESSDA since it wasn't in the linked documentation.
  5. Read it as just checking, not failing, but flag with CESSDA about how strict they would like this?
  6. Nope. Working for one is good enough as proof of concept.
  7. Number of files: enough for the pilot, assuming that it would scale (without rate limits) by just increasing the number. Leave header check for now, with a note for later. Reasonable meaningful but might need a more explicit definition (and some prior investigation about how it's usually done). License identification is hard!
karacolada commented 6 months ago

Put a warning log into general-2 and check that the rest still passes as expected. Then close!