ugexe / zef

Raku Module Management
Artistic License 2.0
209 stars 45 forks source link

Zef should understand SPDX license identifiers #154

Open samcv opened 7 years ago

samcv commented 7 years ago

Now that the license fields have been clairified and are going to be using standardized identifiers, the ability of zef to whitelist and blacklist licenses becomes much more useful.

One thing that does not depend on the ecosystem changing though, is Zef should understand that a whitelist or blacklist of NONE or NOASSERTION.

It has been suggested by @gfldex that we should assign the following situations as follows, though note that the following has not yet been codified at all and is merely a suggestion:

This designates NONE Not explicit enough, unless NONE is specified NOASSERTION should be assumed.

json
"license":   null

This designates NOASSERTION (having no license field at all in the json)

...

Dual licensed projects

Also in addition, dual licensing is specified as follows (this is part of the SPDX spec): Artistic-1.0-Perl OR GPL-1.0+ for example is how Perl 5 is licensed, and even though there is no usage yet in meta files of this, there are modules in the ecosystem that are dual licensed. Example of eco projects which do or should use this: https://github.com/jamesalbert/JSON-WebToken

Projects whole files or sections of source fall separately under different licenses

Artistic-2.0 AND X11 Example of eco projects which do or should use this: https://github.com/slobo/Perl6-X11-Xlib-Raw/pull/5

Thoughts and comments on this are appreciated! Thanks all.

Also reference is the updated S22 meta license section: https://design.perl6.org/S22.html#license

ugexe commented 7 years ago

I think NONE should be explicitly declared and null should be considered a NOASSERTION as well. Yes there should be a difference between submitting 'null' as an argument to some api and not sending anything at all, but not in the representation of the data itself.

samcv commented 7 years ago

I am inclined to think null is NOASSERTION as well, because I don't think it has enough data in it. And if it was turned into a non-JSON file format... it might not have a null, and then it would appear as an empty field.

samcv commented 7 years ago

@ugexe also there seems to be a problem where it does not read the license properly.

I tried using zef with this patch: https://github.com/samcv/zef/commit/17e953b645cf1c076ea39b7f8473a23d809736db

And it never actually gets the module's license. I get this output on installing a module:

whitelist: *
blacklist: 
thismodule: (Any)
ugexe commented 7 years ago

Ah yeah I guess that should be $dist.<meta><license> now (although I think what you had works on really old rakudos, since those use a different Distribution object).

I'm also trying to think of a better place to put this filter in zef. Currently imagine you search for Foo::Bar which depends on Alpha::Omega. Now consider two distributions fulfill the dependency on Alpha::Omega (lets say distributions Alpha and AlphaAlternative)... right now the filter does not get applied until after figuring out the build graph, downloading the distributions, etc. Instead the filtering needs to take place when looking up candidates in the first place, such that if Alpha contains a blacklisted license it will not be used to create a build graph. Thus it can be taken into consideration when searching multiple recommendation managers/emulates/supercededs/etc.

samcv commented 7 years ago
'GPL-1.0+ OR Artistic-1.0-Perl WITH Madeup-exception'.split(' OR ').perl.say
# ("GPL-1.0+", "Artistic-1.0-Perl WITH Madeup-exception").Seq

Unlikely we will have any exceptions because they're usually for compiled libraries and such, but just using split makes it easy to split up things that are OR'd.

In addition, note that GPL-1.0+ means GPL-1.0 and any later version, while just plain GPL-1.0 does not imply or specify that. Example code:

my @things = 'GPL-1.0+ OR Artistic-1.0-Perl WITH Madeup-exception'.split(' OR ');
for @things -> $license is copy {
    my Str $exception;
    my Bool $or-newer-version = False;
    ($license, $exception) = $license.split(' WITH ');
    if $license.ends-with('+') {
        $or-newer-version = True;
    }
    ...
}
ugexe commented 7 years ago

Thanks for the parser logic. I wonder if it is worth trying to treat the version portion as a Version, which would automatically handle the + (although it would also allow * which I assume is not-a-thing defined by SPDX). This would be nice in that is allows zef to continue handling these versions in a similar fashion to perl6 distributions.

$ perl6 -e 'say v1.1 ~~ v1.0+'
True

$ perl6 -e 'say v1.1 ~~ v1.2+'
False
samcv commented 7 years ago

That sounds like that could be cool. using Version objects. Also note that if the project contains files that are under two separate licenses, then the keyword is AND which means they should have both those licenses whitelisted if the module needs to be installed.

See my response here: https://github.com/slobo/Perl6-X11-Xlib-Raw/pull/5

ugexe commented 7 years ago
grammar Grammar::SPDX::Expression {
    regex TOP { <simple-expression> | <compound-expression> }

    token idstring { [<.alpha> | <.digit> | '-' | '.']+ }

    token license-id { <.idstring> }

    token license-exception-id { <.idstring> }

    token license-ref { ['DocumentRef-' <.idstring> ':']? 'LicenseRef-' <.idstring> }

    regex simple-expression {
        | <license-id>
        | <license-id> '+'
        | <license-ref>
    }

    regex compound-expression {
        | <simple-expression>
        | <simple-expression>   ' WITH ' <license-exception-id>
        | <compound-expression> ' AND '  <compound-expression>
        | <compound-expression> ' OR '   <compound-expression>
        | '(' <compound-expression> ')'
    }
}

The above is the translated abnf from https://spdx.org/spdx-specification-21-web-version Appendix IV: SPDX License Expressions

samcv commented 7 years ago

@ugexe

grammar Grammar::SPDX::Expression {
    regex TOP {  <simple-expression> | <compound-expression> }

    token idstring { [<.alpha> | <.digit> | '-' | '.']+ }

    token license-id { <.idstring> }

    token license-exception-id { <.idstring> }

    token license-ref { ['DocumentRef-' <.idstring> ':']? 'LicenseRef-' <.idstring> }

    regex simple-expression {
        | <license-id> '+'?
        | <license-ref>
    }
    proto token complex-expression { * }
    token complex-expression:sym<WITH> { \s+ <( 'WITH' \s+ <license-exception-id> }
    token complex-expression:sym<AND>  { \s+ <( 'AND'  \s+ <simple-expression>    }
    token complex-expression:sym<OR>   { \s+ <( 'OR'   \s+ <simple-expression>    }

    regex compound-expression {
          | <simple-expression>
          | <simple-expression> <complex-expression>+
          | '(' <compound-expression> ')'
    }
}
Grammar::SPDX::Expression.parse('Artistic-2.0 OR blah AND foo').say;
Grammar::SPDX::Expression.parse('Artistic-2.0 OR what').say;
Grammar::SPDX::Expression.parse('Artistic-2.0').say;

Here you go. This works :-)

ugexe commented 7 years ago

I think the original compound expression was supposed to be recursive. Note the current parse tree:

「MIT AND LGPL-2.1+ OR BSD-3-Clause」
 compound-expression => 「MIT AND LGPL-2.1+ OR BSD-3-Clause」
  simple-expression => 「MIT」
   license-id => 「MIT」
  complex-expression => 「AND LGPL-2.1+」
   simple-expression => 「LGPL-2.1+」
    license-id => 「LGPL-2.1」
  complex-expression => 「OR BSD-3-Clause」
   simple-expression => 「BSD-3-Clause」
    license-id => 「BSD-3-Clause」

I'm not sure this allows operator precedence to be easily introduced For instance: (MIT AND (LGPL-2.1+ OR BSD-3-Clause)) from the spdx examples doesn't parse, but if it did it would be ideal if the parse tree made precedence ordering simple. So I think I (possibly incorrectly) expect the parse tree to be multi-level for each compound expression. Like:

「(MIT AND (LGPL-2.1+ OR BSD-3-Clause))」
 compound-expression => 「(MIT AND (LGPL-2.1+ OR BSD-3-Clause))」
   simple-expression => 「MIT」
      license-id => 「MIT」
   complex-expression => 「AND (LGPL-2.1+ OR BSD-3-Clause)」
      compound-expression => 「LGPL-2.1+ OR BSD-3-Clause」
         simple-expression => 「LGPL-2.1+」
            license-id => 「LGPL-2.1」
         complex-expression => 「OR BSD-3-Clause」
            simple-expression => 「BSD-3-Clause」
               license-id => 「BSD-3-Clause」
samcv commented 7 years ago

Try number two :-) This one parses these compound expressions (and all the ones my previous attempt did as well)

grammar Grammar::SPDX::Expression {
    regex TOP { \s* <simple-expression> | <compound-expression> \s* }

    token idstring { [<.alpha> | <.digit> | '-' | '.']+ }

    token license-id { <.idstring> }

    token license-exception-id { <.idstring> }

    token license-ref { ['DocumentRef-' <.idstring> ':']? 'LicenseRef-' <.idstring> }

    regex simple-expression {
        | <license-id> '+'?
        | <license-ref>
    }
    proto token complex-expression { * }
    token complex-expression:sym<WITH> { \s+ <( 'WITH' \s+ <license-exception-id> }
    token complex-expression:sym<AND>  { \s+ <( 'AND'  \s+ <compound-expression>    }
    token complex-expression:sym<OR>   { \s+ <( 'OR'   \s+ <compound-expression>    }
    regex paren-expression {
        '(' <compound-expression> ')'
    }
    regex compound-expression {
          | <paren-expression>
          | <simple-expression> [<complex-expression>+]?
    }
}
Grammar::SPDX::Expression.parse('MIT AND (LGPL-2.1+ OR BSD-3-Clause)').say;

This one parses:

「MIT AND (LGPL-2.1+ OR BSD-3-Clause)」
 compound-expression => 「MIT AND (LGPL-2.1+ OR BSD-3-Clause)」
  simple-expression => 「MIT」
   license-id => 「MIT」
  complex-expression => 「AND (LGPL-2.1+ OR BSD-3-Clause)」
   compound-expression => 「(LGPL-2.1+ OR BSD-3-Clause)」
    paren-expression => 「(LGPL-2.1+ OR BSD-3-Clause)」
     compound-expression => 「LGPL-2.1+ OR BSD-3-Clause」
      simple-expression => 「LGPL-2.1+」
       license-id => 「LGPL-2.1」
      complex-expression => 「OR BSD-3-Clause」
       compound-expression => 「BSD-3-Clause」
        simple-expression => 「BSD-3-Clause」
         license-id => 「BSD-3-Clause」

P.S.: I still can't get this to parse though: (MIT AND LGPL-2.1+) OR BSD-3-Clause

ugexe commented 7 years ago

Cool. Is it possible to make paren-expression match before compound-expression so that the first line of this snippet would not be needed?

   compound-expression => 「(LGPL-2.1+ OR BSD-3-Clause)」
    paren-expression => 「(LGPL-2.1+ OR BSD-3-Clause)」
     compound-expression => 「LGPL-2.1+ OR BSD-3-Clause」
samcv commented 7 years ago

Ok this works as you asked, and I eliminates some other unneeded extra tokens too: https://github.com/samcv/zef/blob/spdx-parser/spdxparser.p6

Let me know if i need to make any changes.

「MIT AND (LGPL-2.1+ OR BSD-3-Clause)」
 compound-expression => 「MIT AND (LGPL-2.1+ OR BSD-3-Clause)」
  simple-expression => 「MIT」
   license-id => 「MIT」
  complex-expression => 「AND (LGPL-2.1+ OR BSD-3-Clause)」
   paren-expression => 「(LGPL-2.1+ OR BSD-3-Clause)」
    compound-expression => 「LGPL-2.1+ OR BSD-3-Clause」
     simple-expression => 「LGPL-2.1+」
      license-id => 「LGPL-2.1」
     complex-expression => 「OR BSD-3-Clause」
      simple-expression => 「BSD-3-Clause」
       license-id => 「BSD-3-Clause」
ugexe commented 7 years ago

Awesome. I just need to figure out how to make this work at the search phase now!

I think it'll end up like Zef::Identity and/or Zef::Distribution::DependencySpecification. But I also have to keep in mind how to make it work well for both Zef::Repository::Ecosystems (local json package list of metadata. it doesnt have to be the "ecosystem" ecosystem) and Zef::Repository::MetaCPAN. The first needs to filter the metadata json of each distribution itself, the second needs to outsource the filtering to MetaCPAN as a query (and still does some local filtering but thats irrelevant).

The MetaCPAN one is nice in that it easier to work with AND/OR, but not sure how to translate + and WITH to a elastic-search query... http://hack.p6c.org:5000/v0/release?q=license:unknown AND (license:unknown OR license:unknown)

samcv commented 7 years ago

I put it here: https://github.com/samcv/SPDX-Parser I tried to use a parser class. But this is the first parser class I've ever done (at least other than playing around).

I gave you collaborator rights to the repo, so feel free to work on it there. Have some basic testing, I was thinking of having things end up in an array. so:

"MIT OR GPL-1.0" >>>> [ ['MIT', 'GPL-1.0'], ]
"MIT AND GPL-1.0" >>>> [ ['MIT'], ['GPL-1.0', 'Artistic-1.0-Perl'] ]

But this may not make sense, I was just thinking that for checking if we can use a project, we need to match at least one license in each index of the array. For the first example OR there's an array with one array in it. So you need to match one of the ones in there. Since there's only one index of the outer array, if we pass that, then we're fine.

With AND, we have to pass each of the inner arrays, in this case we need to pass MIT, and either GPL-1.0 or Artistic-1.0-Perl.

Not sure if this is how the objects we want to have in Zef are like this, but it was a conceptual way for me to try and work things out... Feel free to ignore anything I said if it sounds bad.

ugexe commented 7 years ago

I have not forgotten about this. However I do first plan on removing a lot of backwards compatibility cruft so that I can create cleaner Distribution objects, which will then allow easy interfacing with these things at whatever stage you want.

samcv commented 7 years ago

Sounds good. Also for now it's probably good enough to just support 'AND' and 'OR' and then deal with the parser fully later. I've been busy and I remember running into a few problems. Haven't looked at it for at least a week, but that should not prevent us from adding support for more simple license identifiers.