jslicense / spdx-expression-parse.js

parse SPDX license expressions
https://npmjs.com/packages/spdx-expression-parse
MIT License
43 stars 23 forks source link

Fails to parse valid license strings. #2

Closed cscott closed 8 years ago

cscott commented 8 years ago

For example:

> p = require('spdx-expression-parse')
[Function]
> p('PHP-3.0')
{ license: 'PHP-3.0' }
> p('PHP-3.01')
Error: Lexical error on line 1. Unrecognized text.
PHP-3.01
-------^

...even though PHP-3.01 is indeed a valid SPDX license. This applies to 20 licenses in total:

> p = require('spdx-expression-parse')
> ids = require('spdx-license-ids')
> s = new Set(ids);
> ids.filter(function(ss) { try { p(ss); } catch (e) { return true; } });
[ 'Artistic-1.0-Perl',
  'Artistic-1.0-cl8',
  'BSD-2-Clause-FreeBSD',
  'BSD-2-Clause-NetBSD',
  'BSD-3-Clause-Clear',
  'BSD-3-Clause-Attribution',
  '0BSD',
  'BSD-4-Clause-UC',
  'CECILL-2.1',
  'CNRI-Python-GPL-Compatible',
  'CrystalStacker',
  'FSFULLR',
  'Interbase-1.0',
  'BSD-3-Clause-LBNL',
  'MPL-2.0-no-copyleft-exception',
  'OLDAP-2.0.1',
  'OLDAP-2.2.1',
  'PHP-3.01',
  'Sendmail',
  'SISSL-1.2' ]
cscott commented 8 years ago

(BTW, the above would make a nice sanity-check addition to the test suite for this package.)

cscott commented 8 years ago

And the licenses which fail to parse are actually present in node_modules/spdx-expression-parse/parser.generated.js so this doesn't seem to be a simple "need to regenerate the parser" issue -- although I will note that this package seems to be quite fragile that way: it only uses spdx-license-ids at parser-generation time; it would be nicer if it used the latest license list at runtime, ie by matching license tokens as [-A-Za-z0-9.]+ and then validating them against spdx-license-ids.

cscott commented 8 years ago

There seems to be a similar issue with exceptions, fwiw.

kemitchell commented 8 years ago

Thanks so much for this issue, as well as #3. Really great stuff.

I will land #3 ASAP. I hope you don't mind if I follow on with a small PR to conform code style before publishing with a semver patch bump.

I also plan to add you as a GitHub collaborator, because you clearly deserve it, and because this is too important for me to do alone. Please create PRs for anything significant; I will do the same from here on out. I've also protected master and gated on Travis CI green, mostly to protect the project from my itchy rebase trigger finger.

As for whether to make the parser itself generic and validate against the ID and exception lists brought in via require at runtime, I'm predisposed against, but am willing to be convinced. If you'd like, please feel free to start a new issue on that topic specifically.

Again, thanks for much!

cscott commented 8 years ago

I have no problem with code style tweaks.

I actually experimented with making the parser generic in the process of writing this patch, but it wasn't entirely clear how best to do so. Jison was rather poorly documented, especially in the way you are using it here, and I couldn't immediately discover a way to add a prologue clause to the generated parser in order to access the lists at runtime. In the end, writing the shortest simplest patch won out.

The parser getting out of sync with the license and exception lists is just something to keep an eye on, I guess. Too bad travis won't rerun the test suite on the master branch from time to time (every week?), that would could be made a way to ensure that things didn't slip.

kemitchell commented 8 years ago

Fortunately, the synchro issue is mostly spdx-license-ids' to deal with. Fortunately, were able to convince the SPDX folks to publish a JSON file on spdx.org that we can scrape:

https://github.com/shinnn/spdx-license-ids/blob/master/build.js#L13

RubyGems, which just landed validation in 2.5.0, is also relying on that JSON list, and has the same synchro problem.

Fortunately, it's not a big problem. While SPDX is very active in considering new licenses, the vast majority of packages in the big open-source repos use one of a very small subset of the overall SPDX list.

cscott commented 8 years ago

I discovered this because npm wouldn't accept PHP-3.01 as a valid SPDX license string on my php-embed package. Presumably if you just do a patch bump, the next release of node will bundle an npm with an updated spdx-expression-parse, as none of the dependencies in the chain npm->validate-npm-package-license->spdx-expression-parse is pinned. But it might be worth checking upstream and making sure there's not a more formal process required.

cscott commented 8 years ago

(Oh, and your defence package is quite nice, I've often been concerned about the examples in my README drifting out-of-date. I'll have to borrow that.)