Open joshdk opened 7 years ago
On Sun, Dec 03, 2017 at 02:50:34AM -0800, Josh Komoroske wrote:
When testing tools against the test cases in this repo, it would be very convenient if the individual test case files were written in a format that can be easily unmarshalled, and enriched with relevant classification data.
This has come up before. Personally, I'd rather put it off until someone makes an argument for “I'd like to add $METADATA to support $USECASE” 1. Until we have someone arguing for a particular usecase, it seems like more trouble than it's worth.
As an extension (and this may be better suited for it's own request), it would be helpful to have examples with (post-templated) copyright messages. Having
Copyright (c) 2007 John Doe
somewhere inside of the content body (instead of just `Copyright (c)`), and that have that message as its own field in the training data for verification.
This may be one of the metadata use cases. However, do you really need metadata to test this? If we provide one example here with the canonical copyright information and another with an alterative (and similarly for other optional/variable fields), if you match the license at all, you've presumably captured the right content from those fields.
When testing tools against the test cases in this repo, it would be very convenient if the individual test case files were written in a format that can be easily unmarshalled, and enriched with relevant classification data.
As an example, maybe the contents of
license-test-files/withid/WTFPL.json
could look like this:This format would help to echo similar examples, such as those already found in spdx/license-list-data.
As an extension (and this may be better suited for it's own request), it would be helpful to have examples with (post-templated) copyright messages. Having
Copyright (c) 2007 John Doe
somewhere inside of the content body (instead of justCopyright (c) <year> <copyright holders>
), and that have that message as its own field in the training data for verification.This would be especially useful if a detection system is attempting to extract attribution data from licenses.
Cheers!