jpeddicord / askalono

A tool & library to detect open source licenses from texts
Apache License 2.0
255 stars 25 forks source link

How much should whitespace matter? #93

Open workingjubilee opened 6 months ago

workingjubilee commented 6 months ago

Deeply open-ended question, but the following file is a direct copy of https://spdx.org/licenses/AGPL-1.0.html "by hand" (right-click, copy, paste), but askalono id only scores 0.999 instead of the 1.0 that printing the extract from the JSON gets you: LICENSE-RIGHTCLICK.txt

It's not clear to me which is the canonical version and thus which is (arguably) a license violation. It's also not clear to me that askalono should fudge the line breaks here. It's also not clear to me that askalono should NOT fudge the line breaks here.

workingjubilee commented 6 months ago

I can't find an option to enable "massage newline differences like this one" in the library API, and I think that doing so might be worth it as an option on top of the whole "the return value is a ratio reflecting the scoring of it as a match" bit.

That said, the original issue seems to be a problem in the underlying data used: SPDX has subtle differences between the HTML and JSON renderings in terms of how it emits spaces.