Closed jpeddicord closed 6 years ago
Hi, Have you considered "just" isolating the headers and searching these headers ? That would make the analysis language-dependant, I guess, but it shouldn't bring in enormous additional complexity, I hope. (and thanks for this very promising tool)
@camillem That's a decent suggestion. askalono is already pretty good at stripping out special characters and whatnot, so:
/* This is my
* license header
* Copyright me
*/
turns into:
this is my
license header
copyright me
I'm more keen to rely on this kind of "natural" text matching instead of trying to figure out what comments look like in each kind of language, guessing where a header might start and end, etc.
Plus, on commodity hardware, a license match (against the entire set) takes about 5ms. Matching against only one license repeatedly to narrow that down might add another 3-4ms, tops.
But, this is a great suggestion to think about, and I'll keep mulling this over. :)
Very messy prototype in the processing-wip branch. It does work, at least the unit tests do.
So here's the deal:
I've gotten a good bit of feedback that it would be Extra Cool if askalono could detect licenses in source code. It kind of can do this now, because the SPDX dataset has license headers embedded. For example, running it on the Rust sources in this repo tends to output
Apache-2.0
... but with very low confidence.This is my plan: If an ID comes up with low confidence (less than the 0.8 threshold in the CLI, maybe?) then take the "winner" still, and repeatedly re-scan that single license on the file. In more detail:
start
andend
of the normalized text (so0
andlength-1
, etc)start
down through the text.start
pointer backwards.end
, but up from the end of the text.The order of
start
andend
doesn't really matter; we'll always do both. It's likely that thestart
search will end very quickly since license headers tend to be at the start of files.However, the result is no longer e.g. "askalono identified this file as Apache 2.0", it's "askalono found Apache 2.0 at lines 2-15".
This is likely blocked by #14. Also, I want to think about this a lot more, so assigning to myself.