Open rpavlik opened 3 years ago
The onnx models are binary protobufs, and there is a model_license
field available in the model proto file - https://github.com/onnx/onnx/blob/master/docs/IR.md#optional-metadata, however this field is optional so I don't know how many of the test models it's set on. If they were set for all the models you'd see it if you ran strings
on them.
Thanks for the tip about that field: In my brief perusal of strings output I didn't see anything like that, so I'm assuming at least most of the models are missing that.
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
This is still an issue.
onnxruntime was rejected from Debian because of unclear license for models. This issue probably also prevents other distro to provide onnxruntime. It would be nice if we can find a way to sort out that.
rpavlik's proposal to use REUSE
makes a lot of sense and would facilitate the redistribution of onnxruntime.
Lack of onnxruntime in debian means that we can't package any software that uses it, either (or we have to turn off any features that use it, because it's not available), so this is blocking further adoption. Debian serves as upstream for Ubuntu: packages that get into Debian end up in Ubuntu (and others) as well.
Background
I'm working on trying to package onnxruntime for Debian, part of which involves making a fairly in-depth inventory of the copyright and license of files in the repo. In a perfect world, all open source projects would follow something like https://reuse.software , which puts unambiguous copyright and license (machine readable) in every file, or adjacent in a
.license
file for those files that cannot have a license header easily added.Most of the source code I can figure out, but where I'm particularly having trouble is with the numerous
.onnx
model files in the repo. (and any other binary files) They aren't plain text, of course, so there's no plaintext header, there's limited license info near them (e.g. there'swinml/test/collateral/models/LICENSE.md
describing two licenses but it doesn't say which files those apply to), and there are files that look like they may have come from other projects - the models that are more than trivial tests according to their filename.My current workaround
Right now, I'm just excluding everything that doesn't appear to be a trivial test or one of the two licenses in the mentioned file above, as "potentially not DFSG-free", but I'm not even confident that things with a trivial-sounding filename are in fact trivial/artificial examples constructed as part of this project and licensed along with it, vs which things are third-party models that just exhibit some important or testable behavior.
My alternative at the moment is to manually review the output of
strings
run on each model to make sure I only see generic-looking things, and check the git history of each file additionally, which doesn't necessarily give me a lot of data either. I'd like to be able to actually run the tests in the Debian package building process, but with so many mystery files I may just end up excluding all.onnx
files from the repacked source to be sure to avoid license problems.Describe the solution you'd like
It would be really great if someone more familiar with the history of these files, etc. could go through and add
.license
files next to each .onnx file (as in the REUSE standard) with copyright holder and SPDX-License-Identifier. It would make using this library elsewhere much more feasible. These can be made pretty quickly by hand or by using the reuse project's command line tool.I'm pretty sure additional information could be added to these files for models from external sources, as long as they start with something parsable as copyright line(s) and SPDX-License-Identifier line(s).
(For an example of one I'm not equipped to figure out: it looks like the original "fast neural style" models might be "free for research and non-commercial use" which isn't DFSG free or OSI open source. I am not sure which FNS-related models are just conversions of these original non-free models, and which are independent re-implementations based on the paper and thus subject to some other license. I also see some things mentioning Bert, whose original models are Apache-2.0, etc.)
REUSE-compliance would be even better, but there's more work to that, and I can extract or infer license and copyright for most source code files just fine, it's the models I struggle with.
Happy to help or offer advice, I just don't have enough background in ML or this project to answer some of the questions around the origins of these files.
System information