jpeddicord / askalono

A tool & library to detect open source licenses from texts
Apache License 2.0
261 stars 26 forks source link

weird scoring / wrong identification for Apache-2.0 license text without appendix since spdx-license-list data v3.23 #94

Open decathorpe opened 8 months ago

decathorpe commented 8 months ago

see also https://github.com/spdx/license-list-XML/issues/2418

The spdx-license-list v3.23 update added "Pixar" license, which is a variant of Apache-2.0.

Using this version of the SPDX data, Apache-2.0 licenses without appendix (like the one from the rust-lang/rust repo), the file is now a closer match to "Pixar" than it is to "Apache-2.0" despite being a perfect copy except that the appendix is missing.

Is it possible that this is because the appendix that is marked as optional is not missing entirely? see https://github.com/spdx/license-list-XML/issues/2418#issuecomment-1995028762

jpeddicord commented 1 month ago

Apologies for the incredibly slow reply here! I'm seeing that SPDX might have split out the optional sections of this which could help. I'm pulling in updates for that now and am encountering other scoring issues (BSD-3-Clause, this time) to debug -- hopefully nothing too crazy.

For what it's worth, regression tests can be added in to tests/data/real-licenses; if there's a particular license in the future that's causing trouble then this can help inform the problem a little bit. But because of the way text-matching works in this library, only so much it will do.

decathorpe commented 1 month ago

Thank you for taking a look! Yeah, I reported this issue to the SPDX people, and they split the optional parts of the appendix further to try to help with this.

But I tried with the latest spdx license data version, and the issue is still there - this license text (without appendix but with the "END OF TERMS OF CONDITIONS" line), which is used by many Rust projects because they just copy the files from the rust-lang/rust repo, still triggers the issue of getting mis-classified as "Pixar":

https://github.com/rust-lang/rust/blob/master/LICENSE-APACHE