spdx / license-test-files

Test files which can be used to check license scanners.
3 stars 3 forks source link

Include testing samples in an easily unmarshallable format #7

Open joshdk opened 7 years ago

joshdk commented 7 years ago

When testing tools against the test cases in this repo, it would be very convenient if the individual test case files were written in a format that can be easily unmarshalled, and enriched with relevant classification data.


As an example, maybe the contents of license-test-files/withid/WTFPL.json could look like this:

{
  "name": "MIT License",
  "licenseId": "MIT",
  "content": "/*\nMIT License\n\nCopyright...#include <nothing.h>..."
}

This format would help to echo similar examples, such as those already found in spdx/license-list-data.


As an extension (and this may be better suited for it's own request), it would be helpful to have examples with (post-templated) copyright messages. Having Copyright (c) 2007 John Doe somewhere inside of the content body (instead of just Copyright (c) <year> <copyright holders>), and that have that message as its own field in the training data for verification.

This would be especially useful if a detection system is attempting to extract attribution data from licenses.

Cheers!

wking commented 7 years ago

On Sun, Dec 03, 2017 at 02:50:34AM -0800, Josh Komoroske wrote:

When testing tools against the test cases in this repo, it would be very convenient if the individual test case files were written in a format that can be easily unmarshalled, and enriched with relevant classification data.

This has come up before. Personally, I'd rather put it off until someone makes an argument for “I'd like to add $METADATA to support $USECASE” 1. Until we have someone arguing for a particular usecase, it seems like more trouble than it's worth.

As an extension (and this may be better suited for it's own request), it would be helpful to have examples with (post-templated) copyright messages. Having Copyright (c) 2007 John Doe somewhere inside of the content body (instead of just `Copyright (c)

`), and that have that message as its own field in the training data for verification.

This may be one of the metadata use cases. However, do you really need metadata to test this? If we provide one example here with the canonical copyright information and another with an alterative (and similarly for other optional/variable fields), if you match the license at all, you've presumably captured the right content from those fields.