scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
58.35k stars 25.02k forks source link

Correctly document linked libraries #27559

Open stefan6419846 opened 7 months ago

stefan6419846 commented 7 months ago

Describe the issue linked to the documentation

When downloading the current wheel for scikit-learn==1.3.1, the metadata tell me that the package is subject to the terms of BSD-3-Clause. Unfortunately, this only applies to the package itself. Skimming through the distributed files, there are at least two additional cases:

Suggest a potential alternative/fix

It would be great if a full list of external modules shipped within scikit-learn wheels and their copyright information would be provided to detect possible license conflicts early.

adrinjalali commented 7 months ago

Hmm, this is interesting. This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think. This seems like an oversight from our side.

cc @scikit-learn/core-devs

stefan6419846 commented 7 months ago

IANAL, but: GCC has the runtime exception which should reduce the general risk (see copyright header as well): https://www.gnu.org/licenses/gcc-exception-3.1.html Nevertheless, if this is clearly documented on the scikit-learn side, this should at least resolve basic confusion.

glemaitre commented 7 months ago

External code snippets under licenses like MIT, Apache-2.0 and Python-2.0

@stefan6419846 Could you provide the way you found them?

This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think.

Actually, it means that you need to build scikit-learn from source using an OpenMP that is not GPL because we don't bundle within the package but only in the wheel.

I assume that the only way that we can work around is to always use llvm compilers with the llvm-openmp as we already do for the MacOS wheels. The licence is Apache-2 in this case.

stefan6419846 commented 7 months ago

Could you provide the way you found them?

I used https://github.com/stefan6419846/license_tools, a custom wrapper around https://github.com/nexB/scancode-toolkit/

glemaitre commented 7 months ago

@stefan6419846 Thanks. I assume that we should be running such tools and have a proper LICENCE file integrated to the wheels.

stefan6419846 commented 7 months ago

In theory you shouldn't need to run these tools regularly, but perform an initial complete review of the current code base for all external stuff to document it appropriately (and in which cases it is shipped in the official distributions) - this can be assisted by corresponding scanning tools.

Future checks usually can be subject to a general pull request review process, backed by corresponding contribution docs (when and how to include new external code, including indirect dependencies, how to ensure license compatibility ...)

GaelVaroquaux commented 7 months ago

This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think.

I don't believe that this is true.

Still, it would be good from our side to document things better.

stefan6419846 commented 7 months ago

This basically makes it impossible for people to develop private code using scikit-learn as long as libgomp is bundled inside I think.

I don't believe that this is true.

License compliance does not really allow for generalization and IANAL, but yes, it depends on how you use/distribute your applications and what the law department considers appropriate. In general, the GPL being a strict copyleft license can be an issue and some "weaker" license might be desired (like Apache-2.0 with its weak copyleft effect), but internal use without distribution or SaaS-based usage tends to be fine at least.

lorentzenchr commented 7 months ago

As already pointed out, the GCC RUNTIME LIBRARY EXCEPTION states

1. Grant of Additional Permission.

You have permission to propagate a work of Target Code formed by combining the Runtime Library with Independent Modules, even if such propagation would otherwise violate the terms of GPLv3, provided that all Target Code was generated by Eligible Compilation Processes. You may then convey such a combination under terms of your choice, consistent with the licensing of the Independent Modules.

libgomp has this exception and is only included in our binaries (wheels) when compiling via gcc, isn't it? IANAL, I don't see a problem here. And I also don't know if it is a good idea to add anything to the docs.

License scanning, on the other side, is usually a good idea 😏

stefan6419846 commented 7 months ago

And I also don't know if it is a good idea to add anything to the docs.

This depends on the general perspective you want to take. Yes, in general FOSS and especially the liability/warranty clauses of most licenses do not require anyone to provide such information. They can rather serve as some basic indication of the current licensing situation and provide some short hints regarding possible issues, while indicating that someone might be aware of the possible implications.

Given the liability clauses above, I will always have to check for correctness of the statements as well to avoid hidden risks (studies have shown that there are quite some projects which do not correctly document "hidden" licenses). During such a process, I stumbled upon the current documentation limitations and decided to file this issue to further evaluate what a suitable solution could look like.

As some examples, this is how scipy or opencv-python currently handle this: https://github.com/scipy/scipy/blob/main/LICENSES_bundled.txt https://github.com/opencv/opencv-python/blob/4.x/LICENSE-3RD-PARTY.txt

lorentzenchr commented 7 months ago

Scipy really bundles/vendors several whole libraries, i.e., they are included in the scipy source code. The only thing we vendor is liblinear and libsvm, and then a few smaller code snippets like in utils/_pprint.py.

If you think a LICENSES_bundled.txt as in numpy as scipy would help, then PR welcome. This, however, will not solve the (non) issue with libgomp in the wheel.

markdryan commented 1 month ago

If you think a LICENSES_bundled.txt as in numpy as scipy would help, then PR welcome.

I think something like this is required. Assuming I've identified the correct licenses for liblinear and pprint and their code is included in the binary wheels, their licenses, BSD 3-Clause and PSF require that their copyright notices and licenses are supplied with the binaries that contain them. As far as I can tell the scikit-learn wheels do not currently do this.

Regarding, libgomp, although the Runtime Exception clause applies to the scikit-learn code, I believe libgomp itself is distributed under the terms of the GPL v3, i.e., the source code from which it was it built should be provided or should be linked to in some way. See the second paragraph of the section entitled I use a proprietary compiler toolchain without any parts of GCC to compile my program, and link it with libstdc++ in the gcc-exception-3.1-faq. (libstdc++ is also released under the GCC Runtime Library Exception).

Numpy and scipy have had a similar issue with libgfortran in the past which is bundled in their binary wheels and is also released under the same license as libgomp. When the numpy wheels are built, an OS specific text file containing the licenses for all the bundled dependencies (including libgfortran) is now appended to the LICENSE.txt file included in the wheel. The entry for libgfortran in the final LICENSE.txt file contains a link to the libgfortran source code, although not I think, the exact version from which it was built.

lorentzenchr commented 1 month ago

@thomasjpfan Could you contribute something similar to https://github.com/numpy/numpy/pull/20102 concerning the licenses?

thomasjpfan commented 1 month ago

Yea, I'll contribute something like https://github.com/numpy/numpy/pull/20102 for scikit-learn.