jpeddicord / askalono

A tool & library to detect open source licenses from texts
Apache License 2.0
256 stars 25 forks source link

License suite tests #4

Closed jpeddicord closed 6 years ago

jpeddicord commented 6 years ago

I'd like to set up some tests that verify that askalono doesn't regress on its ability to identify licenses. Basically, a directory of license files named/sorted according to their actual license. askalono runs on them, identifies them, and verifies it got the answer right.

Possible implementation

test-licenses/
  Apache-2.0/
    sample-project-name/
        LICENSE
    other-project/
        COPYING
  BSD-3-Clause
    and-another-one/
      LICENSE.txt
  ...

Another layout for this is perfectly acceptable; this could also be set up with some metadata files describing what a file should be identified as, with what confidence, etc. That may be overkill for the time being.

phrohdoh commented 6 years ago

Would you be open to (possibly mentoring) an external contribution for this task?

jpeddicord commented 6 years ago

@Phrohdoh Sure! Give me a little bit to flesh out this issue description (a lot of these issues I initially filed as "notes to self" and their descriptions kinda suck) -- I have a scrappy collection of license files I was originally testing with locally and have a rough idea of a way this could proceed.

jpeddicord commented 6 years ago

@Phrohdoh I clarified the description a bit. If you're still interested in this, I'd recommend playing around with Store as a starting point: have it load a cache file (or load from the SPDX directory; see cli/build.rs for an example of that) and get it to identify a license. examples/basic.rs has some of that as well. The documentation sucks (sorry) but the types involved should be relatively easy to figure out.

I opened up a Gitter room at https://gitter.im/amzn/askalono if you want to ping me for help; you may need to @ me as I'm not entirely sure I've configured notifications correctly 🙃

phrohdoh commented 6 years ago

Great, thanks!

I will take a stab at this over the upcoming weekend.

jpeddicord commented 6 years ago

@Phrohdoh no rush, but let me know if you'd like any extra help here. If you want, feel free to PR the code you showed on Gitter and we can iterate from there.

jpeddicord commented 6 years ago

I've set up the remainder of this infrastructure:

https://github.com/amzn/askalono/blob/master/tests/real_world.rs

This builds on @Phrohdoh's initial work and crawls a directory, parsing out expected license names and thresholds to test licenses in a flexible manner. While I definitely want to add more licenses to this test dataset, I think we can consider this issue resolved. :)