jpeddicord / askalono

A tool & library to detect open source licenses from texts
Apache License 2.0
256 stars 25 forks source link

Binary search for most likely location of license within file #15

Closed jpeddicord closed 6 years ago

jpeddicord commented 6 years ago

So here's the deal:

I've gotten a good bit of feedback that it would be Extra Cool if askalono could detect licenses in source code. It kind of can do this now, because the SPDX dataset has license headers embedded. For example, running it on the Rust sources in this repo tends to output Apache-2.0... but with very low confidence.

This is my plan: If an ID comes up with low confidence (less than the 0.8 threshold in the CLI, maybe?) then take the "winner" still, and repeatedly re-scan that single license on the file. In more detail:

  1. Store the line numbers of the start and end of the normalized text (so 0 and length-1, etc)
    1. Binary search start down through the text.
      1. Re-ID the sub-section of the file.
      2. If the score increased or stayed the same, keep going.
      3. If the score decreased, move the start pointer backwards.
      4. Repeat the search until a local maximum score is found.
    2. Perform the same binary search on end, but up from the end of the text.
  2. Once start/end bounds have been determined, return the result.

The order of start and end doesn't really matter; we'll always do both. It's likely that the start search will end very quickly since license headers tend to be at the start of files.

However, the result is no longer e.g. "askalono identified this file as Apache 2.0", it's "askalono found Apache 2.0 at lines 2-15".

This is likely blocked by #14. Also, I want to think about this a lot more, so assigning to myself.

camillem commented 6 years ago

Hi, Have you considered "just" isolating the headers and searching these headers ? That would make the analysis language-dependant, I guess, but it shouldn't bring in enormous additional complexity, I hope. (and thanks for this very promising tool)

jpeddicord commented 6 years ago

@camillem That's a decent suggestion. askalono is already pretty good at stripping out special characters and whatnot, so:

/* This is my
 * license header
 * Copyright me
 */

turns into:

this is my
license header
copyright me

I'm more keen to rely on this kind of "natural" text matching instead of trying to figure out what comments look like in each kind of language, guessing where a header might start and end, etc.

Plus, on commodity hardware, a license match (against the entire set) takes about 5ms. Matching against only one license repeatedly to narrow that down might add another 3-4ms, tops.

But, this is a great suggestion to think about, and I'll keep mulling this over. :)

jpeddicord commented 6 years ago

Very messy prototype in the processing-wip branch. It does work, at least the unit tests do.