README / file checks too loose

myndzi commented 9 years ago

The presence of text such as 'MIT' is a pretty sloppy way to identify a license, since it doesn't actually compare against meaningful content. It wouldn't be as big a problem comparing against a file that is explicitly expected to be a license file, e.g. LICENSE, but it is a problem when comparing against files containing arbitrary text, such as README (I found just now that due to the examples given in my readme, nlf detects about four different licenses, most of them inaccurate).

I do realize that doing a content comparison is a rather trickier thing to accomplish here, but it should be considered and preferably implemented at some point. I believe I have code that may help from last time I was working on this; I was attempting to apply the actual spdx definitions in a generic way. I will see if I can dig it out for possible inclusion.

iandotkelly commented 9 years ago

Duplicate of #7 and #16

Please give examples of your false positives since it is looking for (example) MIT, capitalized, with whitespace. Its not foolproof by any means, but a small rate of false positives is better than missing a license entirely.

Its a major undertaking doing this - I do have an intention to try it out, but it would need to implement some sort of sliding search window over possible files, running a string difference match, and setting some sort of reliable threshold. I have no idea how performant it would be without running tests.

iandotkelly commented 9 years ago

@myndzi - the answer to how quick is content comparison, I did a sliding window over my readme file, which isn't that long, looking for a typical MIT license description, using Levenshtein distance matching. The sliding window incremented 10 characters down the file. It took 8 seconds to process one file with one license. With a project of any decent size, with a decent number of licenses it will take some time to do this.

The lowest score of the match was 35 characters ..... a lower score is better, 35 means 35 characters different between the sliding window and the license text. By increasing the step of the sliding window to 30 characters, the distance increased to 53 characters, but the scan of one file took 2.5 seconds.

If you look at the data - the distance is around ~900 characters for random text against the MIT - dropping to ~50 when it encounters the license in my file.

It might be usable - it might not.

myndzi commented 9 years ago

Levenshtein distance is not reliable here, either. It is important that the substantive text of a license is matched entirely; a false positive is indeed much worse than a false negative. Consider:

A false negative tells you "I don't know -- you'll have to look yourself". Then you know there may be a problem, you look, and you resolve it.

A false positive tells you "I've determined that this is the license this package uses this license". You don't know there's a problem, and you don't look, and now you have a problem without knowing it.

On the SPDX site there is a repository with a list of license templates and a set of matching rules that have been considered for legal/substantive value. I encourage you to look into utilizing this for matching, and to remove the non-substantive matching that currently exists.

The SPDX templates have their own set of problems, unfortunately; the project seems to want to define their matching in difficult-to-program/human-easy terms, rather than curate their template database to be easy-to-program. This is the problem I've run into with the license matching code I've got; for example, the Apache 2.0 license doesn't match often because the template in the SPDX database includes a trailer/appendix that is not marked optional but is optional. Their matching rules say something like "Ignore this kind of thing", which is silly since it's difficult to programmatically determine what is valid and meaningless to ignore.

I've been considering writing some sort of manually curated patch to apply over the SPDX database to clean up their license templates into something more useful.

In some testing against the top 6 pages of 'most depended' modules from NPM, only one so far has not had something that could not be determined from package.json and also couldn't be matched against the SPDX templates, and this one was the Apache case. I will push my work-in-progress to a repository so you can examine it if you like.

myndzi commented 9 years ago

To answer your question, here's an example: https://www.npmjs.com/package/node-license-validator

The phrases 'ISC', 'MIT', and 'WTFPL' are all mentioned in the readme, not in the context of "this package is released under this license", but all are detected (this is what brought this to my attention). At the very least, an approach like node-licensecheck has, where only a section of the readme titled 'License' is examined, is warranted. In the worst case, you could wind up catching some text where the author explains why they are NOT using some license, and yet the package is detected as the license they explicitly are not using.

The repo for the license matcher is here: https://github.com/myndzi/spdx-license-matcher

You should be able to npm install in the test directory and run node findem to scan everything in test/node_modules. It will print packages for which it didn't find any license.

There's currently a package.json pre-check that takes precedence over file-matching (I was trying to see how many modules with hard to match licenses were not covered by package.json).

iandotkelly commented 9 years ago

@myndzi

Yeah - its not as simple as saying "false positives are bad" - i.e. flagging licenses that don't apply. The original requirement that drove the development for nlf was to find GPL licenses in a project for a client who would have refused the application if it included that license - therefore a more permissive matching was required.

The idea behind nlf is to find the potential files where licence information is found, not specifically to be an auditing tool. If you can present a lawyer with a list of files to look at, to help their assessment of a project this is more useful than an unreliable automated audit.

I'm happy for nlf to have more features to make it more amenable to audit - but I think clear requirements are needed.

iandotkelly commented 9 years ago

You say that distance-matching isn't going to be good enough for an automated audit, and you are probably correct.... but (as you say) the SPDX matching guidelines are unfortunately not particularly amenable to automation without considerable cleanup / marking up the license text data.

So we are at a bit of an impasse. Your approach don't report a license until you are totally sure it is there I don't believe is achievable with the SPDX data.

My approach, report the likelihood of a license makes it difficult to use as an auditing tool if the user is looking for the presence of approved licenses rather than the presence of unapproved license.

I think that reporting file paths and difference matching scores are the most likely in the circumstances. The user can decide in an auditing tool what confidence % is good enough.

myndzi commented 9 years ago

So what I've got as of now matches all the unambiguous license/readme files of the top 180 dependencies on NPM. There are 14 ambiguous or freeform references that cannot be easily extracted, but they are covered by package.json dependencies.

I've made a few changes to my approach:

If license content doesn't match, it tries to match known urls of license references; this covers cases where something like README.md has a section that just says

# License
Copyright 2015 Some guy
http://license-url/

It's broader than I'd like since the markdown section-finder falls back to the full markdown text if no license section is found, so this means that if the license url is mentioned in the content of a readme for some reason, in a non-license section, and there is none, it could potentially provide a false positive. With some refactoring I should be able to limit the URL search to only apply to "found" license sections.

I have a list of additional license URLs that aren't included in the SPDX list, but are used in various packages, to augment the url-matching function.

I have included templates for the "short-form" license declaration as given in the end of licenses such as Apache-2.0, which are not included in the SPDX list.

I have included overrides for the BSD-3-Clause and ISC licenses that match alternate phrasings I encountered, particularly in the BSD license where the phrase "neither the name of X nor the names of the contributors" sees wide variance.

I have not (yet?) included an override template for the MIT license, where "ARISING FROM" was left out in one sample. (This sample contained a valid package.json license reference). I'm uncertain if this is because the sample copied a previous version of the text, or if it's a mistake or intentional alteration.

I've verified by sight that removed text from the header and footer of the license templates are not substantive with the exception of the APL 1.0, which is an uncertain case. It contains "THE LICENSE TERMS END HERE (OTHER THAN AS SET OUT IN EXHIBIT A)", followed by "EXHIBIT A" which is a form-style layout to, presumably, supply the contributor information. This part would need to be templated/wildcarded out anyway, and does not contain legal text, so I think it's OK to remove.

All in all, I think that the results given are strong positives, but can accommodate a number of variants in formatting, styling, and, to a small extent, content; I'd be interested to maybe clone the NPM repository entire if I could sort by something like popularity and run over a larger sample set.

Pushing updated code for spdx-license-matcher shortly.

myndzi commented 9 years ago

Re: clear requirements

I don't think being stricter with the matching of licenses is contradictory to the "find unwanted licenses" approach; in the worst case, there might be more unidentified licenses, which require manual inspection. In the best case, there is no difference.

Compare this with the opposite, where a positive determination is given; this mistaken affirmative could potentially even occur in a project with a "blacklisted" license such as the GPL. It would give a false sense of security and leave the user open to potential problems later down the line.

(Though, I think, you can use GPL code and it won't "infect" your code with its license; it's only modifying the GPL'd code itself that this applies to?)

Either way, I think the only useful clear requirement for a license-finding utility is that it should be confident in the data it reports, or else it should report to you which components it was not confident about and you should manually investigate.

I don't really have a problem if we wind up differing in opinion here; I've pretty much got enough code by now to replace nlf entirely, and if it makes more sense / if you would prefer, I can simply include my license-matching code in my project and remove nlf as a dependency. The main reason I opened an issue here is because nlf seems to be in pretty widespread use, and I feel it would be better overall to improve the existing code base than to split and do my own thing.

myndzi commented 9 years ago

I wonder what happens in the case of a package like 'sugar', where the package.json claims the MIT license, but the LICENSE file contains a modified and possibly substantively different text...

iandotkelly commented 9 years ago

@myndzi I'll give a full answer to this all later, hopefully with a code example of what you might need.

Ideally I'd like to maintain nlf as a key module in the area of license management, and I think it beneficial to both projects if we have a good consistent solution.

myndzi commented 9 years ago

Yeah, sorry for brain dumping here. Just seems relevant and helpful for the time being.

I added another 120 of the top depended packages to my test set and tweaked the BSD license again to match a couple weird cases. Starting to think the best bet would be to simply hand craft some regular expressions for the common licenses and ditch SPDX altogether. At the rate BSD gets permuted who knows whether all the other licenses that nobody seems to use will ever match the one time they come up to begin with?

I ran a test comparison of claimed package.json licenses vs detected file content licenses with interesting results. Most of the mismatches are:

multiple license files in directory, only one specified in package.json
package.json specified something like 'BSD' or 'MIT' but text content is a variant like 'BSD-2-Clause' or 'MITNFA'
package.json doesn't have anything, but there is a license file

I found some more insanity cases too though: https://github.com/jindw/xmldom/blob/master/package.json#L21 (what's an array for again?) https://github.com/jrburke/amdefine/blob/master/package.json#L7 (invalid expression - no parens; but why would you 'and' these two licenses anyway?)

The xmldom one is particularly interesting because I got this result: { searchDir: '~/kris.re/spdx-license-matcher/test/node_modules/xmldom', 'package.json': [ 'LGPL' ], files: [ 'MIT' ] }

(Turns out there is no spdx identifier for "LGPL", so the extended url I added wasn't getting supplemented to the data set. Fixed and now it says files: ['LGPL-3.0', 'MIT'])

P.S. - I got a bite on the SPDX bug tracker to what was originally a request to fix one of the licenses but has turned into maybe a broader discussion about automated license detection and stuff. It seems like manpower may be the driving factor in change there, so there is hope that I / we can see some traction on useful changes in the actual SPDX data and format.

iandotkelly commented 9 years ago

@myndzi

My position is that there is still use for a tool that can find license information - in readme and license files. This use is intended to be permissive and exploratory, but maybe that should be more explicitly documented.

I am also happy to adopt a less permissive viewpoint in a auditing mode. But as you've pointed out, what is the legal standing of putting "MIT" in the package.json file, if you have a completely different license prominently displayed in your readme.md file.

nlf gets about 3,000 downloads a month - so I'd rather it was consistent with what you were doing from an auditing viewpoint, to my mind that means nlf pulling out the data for you to use as you see fit.

myndzi commented 9 years ago

The question here is about the data NLF presents to a consumer. It will say "This file contains an MIT license" when the file contains only the phrase "MIT". That, to me, is not reliable information, and I'd like to see the information that's presented be more reliable. I wonder how many of those downloaders actually understand how flimsy some of the data they're getting may be? NLF is one of the only options for performing this kind of information digging; the other one I found does not seem active and states up front that it's experimental and shouldn't be relied on. Yet it appears to be more reliable than nlf itself is, without a significantly larger rate of undetected licenses. If I were judging packages based just on what's presented to me in the readme, I would pick nlf. Judging by what I know of the internals, I'd pick the other one.

I expect if I complete spdx-license-matcher that it will probably be the best shot on npm at reliable content identification. If you're interested in integrating it, I'll happily work with you as I clean up the code and refactor it into something presentable. If you decide to stick with your current approach, that's fine too.

iandotkelly commented 8 years ago

This isn't really the intention of nlf .... its intended to be a tool for finding potential licensed or unicensed products. A fully automated audit of open source seems ambitious until the fields in package.json are compulsory and binding.

iandotkelly / nlf

README / file checks too loose #27