oss-review-toolkit / ort

A suite of tools to automate software compliance checks.
https://oss-review-toolkit.org
Apache License 2.0
1.6k stars 309 forks source link

Reporter: Why ORT is inventing Copyright (c) holder information? #4294

Closed porsche-rishisaxena closed 3 years ago

porsche-rishisaxena commented 3 years ago

We have encountered an issue within ORT where ORT is inventing Copyright (c) holder information by categorizing "Author" (from package.json) and "Copyright (c)" (from license text) under same label i.e. Copyright (c) in the Reporter Web App. What is the legal reasoning behind this?

LourensVeen commented 3 years ago

I'm not an ORT developer, but my understanding of the tool is that it casts a wide net, and leaves the final decisions to the user. So in this case, it's not trying to make a definitive list of copyright owners, it's just scanning through the code for any statements that may indicate that someone is potentially a copyright owner.

Since by the Berne convention anyone who authors a work automatically owns the copyright on it (work for hire excepted, and this can be overridden contractually of course), the fact that someone is the author of a work at least signals the possibility that they are also the copyright owner. In the absence of any other evidence, starting with an Author field seems reasonable if you're looking for copyright owners.

You could argue that in the presence of a copyright statement which explicitly mentions who owns the copyright, any Author statements elsewhere can be ignored. Practice is never that simple though. What if it says "Copyright (c) 2021 Contributors"? Then maybe you'll need that authors list to figure out who that actually is. Or someone could contribute some code, add their name as an author, but not also to the copyright statement. Then there are several options: maybe they actually are a copyright owner and that's just not been properly administrated, maybe they signed over their copyright in a separate CLA that's not in the source code, or maybe the project uses the Apache License 2.0 and we decide we are willing to rely on its built-in CLA-like clause.

All this stuff is beyond what an automated tool can decide for you. So it just makes a big list of everyone it can find that might own (part of) the copyrights to the dependency, and then it lets the user figure out what to do with it. That's already a big help in my opinion.

hyandell commented 3 years ago

I agree with @porsche-rishisaxena - the authors in the metadata files are not intended for use in copyright statements. They are also rarely all the authors of a project, nor are they also guaranteed to be the correct copyright holding entity for the contribution that author provided.

I also think, as a feature, hunting down copyright notices and putting them in a text file is a low priority one. Most open source licenses definitely don't tell you to do this, and even for the ones where folk do it in some cases, it's out of caution than explicit instruction.

LourensVeen commented 3 years ago

It seems to me that we may be looking to use ORT to solve different problems. I'm thinking of a scenario where my company has made some software, and has used some open source libraries which in turn depend on other libraries and it's libraries all the way down.

Now my company wants to release its software, and so we need to know what we depend on exactly and whether we have copyright permission to release those dependencies along with our software package. To answer that question, we need to determine a few things:

  1. What are our direct and indirect dependencies?
  2. Who owns the copyright on that code?
  3. Has the copyright owner licensed the code, and if so, under what license?
  4. Which constraints do the respective licenses impose on us?

The key issue here is point 2. To answer this question, we need to know the legal entities that actually own the work. In an ideal world, every code base would come with clear, unambiguous information about who exactly owns which part of it, and people would never copy-paste code from elsewhere without meticulously administrating (in a machine-readable way) the source of that code, who owns it, and how it's licensed. People would also never contribute code that they don't own or control the copyright of, and there would never be any ambiguity as to who owns a piece of code or whether a particular person has permission to contribute the code on behalf of the copyright owner (say, their employer).

Unfortunately, people frequently do not carefully administrate copyright ownership, people do sometimes copy-paste code from elsewhere, and they do sometimes contribute code that they wrote as an employee of a company to a project without the company having given permission, and companies don't always have clear policies on whether and how that works. (I work in science in The Netherlands, where code written by employees is owned by the university, universities have no explicit policies or processes for licensing it properly, but scientists do put the code they wrote and that the university own online under open source licenses. Can a PhD student license code on behalf of the university? Technically, the answer is no. Will you get sued if you use it anyway? Right now, probably not. But those copyrights last for many decades, and maybe if funding dries up the universities will reconsider their IP policies...)

So this feature of ORT tries to find as much information as possible to help answer this question. Unfortunately, in an imperfect world, it cannot be answered perfectly by automated software, and a human will have to take that information, consider the situation, and decide whether any remaining ambiguity and therefore remaining risk is low enough to be acceptable.

Licenses don't mention anything about this because they're in step 3. You need to own something first before you can license it, and who owns what is determined by the law, not by a license. (Although if someone has sold their copyright then contracts come in, but standard open source licenses do not rely on contract law.)

sschuberth commented 3 years ago

I'm not an ORT developer, but my understanding of the tool is that it casts a wide net, and leaves the final decisions to the user.

As an ORT developer, I can confirm what you wrote, and couldn't have phrased it better. So many thanks for that.

Since by the Berne convention anyone who authors a work automatically owns the copyright on it

Note, however, that AFAIK this is a European-centric view of things (which is fine; ORT is mostly developer in Europe 😀 ). In American law, it seems to actually be possible that an author might not (anymore) hold a copyright on the software (s)he wrote, e.g. because the copyright was transferred to the employer. E.g. in Germany's "Urheberrecht" this is not possible; the author will always keep the "Urheberrecht", and the employer is only granted usage rights.

hyandell commented 3 years ago

It seems to me that we may be looking to use ORT to solve different problems.

Totally - my context was in generating attribution documentation. I agree that you're looking at a different problem.

  1. Who owns the copyright on that code?

It's a very interesting question. Through source control logs, metadata, explicit copyright statements, files like AUTHORS, and analysis of issue trackers, I could see a rough view of who owns the copyright on a project being identifiable. With some form of system that then removes copyright holders whose code has subsequently been removed. It doesn't feel like a simple process though, I don't envy you this work :)

pombredanne commented 3 years ago

FWIW, the underlying ScanCode-toolkit does distinguish between copyright statements and authorship statemnents and report these separately.

sschuberth commented 3 years ago

As the question has been answered and the conclusion is to make this configurable, I'm closing this in favor of #4314.