stan-dev / posteriordb

Database with posteriors of interest for Bayesian inference
161 stars 26 forks source link

Handling of posterior licenses #230

Open MansMeg opened 3 years ago

MansMeg commented 3 years ago

Is the intention to accept data with closed licenses? If not, the CONTRIBUTING.md text should be clarified as to what licenses other than BSD-3 are acceptable. If there are multiple licenses, there needs to be a master list and they all need to be compatible if you're going to put them into the same package as very few licenses are as compatible with other licenses as BSD-3.

ahartikainen commented 3 years ago

Should all models here use same permissive license?

MansMeg commented 3 years ago

@avehtari will check if Aalto data support team can help us with solving this. If we should have a different licence for data and code.

bob-carpenter commented 8 months ago

Was there ever a licensing decision? I want to import the models elsewhere, but can't do that without a license. Also, this repo needs to cite the source and license of any model retrieved from elsewhere. This might be hard as I think some of them came from the example-models directory somewhere on stan-dev, but that was never released with a license as far as I know.

P.S. The reason I'm asking is that we are building out a database of models at Flatiron Institute with just the model implementation, data, and draws, but we can't do that without a license in place. The main motivation is that (a) we can distribute big data sets through our cluster, and (b) we can strip down the complexity of the R and Python packages so that the distribution is just the Stan programs, data, and draws. This is another thing I don't want to do through the Stan project because I don't want to have to compromise with a bunch of people on the goals or contents.

MansMeg commented 8 months ago

So, I opened up this discussion with the Aalto lawyers in 2020, but the pandemic struck, and everything stopped. We need to know how to do this in a good way. As you say, most models are from within the community, and some data I have gotten okay on to put in posteriordb, but we didn't discuss licence.

Do we have someone that knows about licencing of data and this type of this?

bob-carpenter commented 8 months ago

I know the basics, but we also have access to IP lawyers for free through NumFOCUS if there are more complicated issues.

Copyright is automatically assigned to whoever writes code or text. The author can reassign copyright. For example, most faculty contracts at American universities stipulate the all copyright for code is reassigned to the university, but all text is owned by the author. I have no idea what contracts in Sweden are like.

The copyright owner can choose to distribute their copyrighted work with a license. Once the copyright holder does that, you can use it according to the license. For example, other projects can use stan-dev/stan and stan-dev/math and stan-dev/stanc3 code under the BSD-3 license without further permission from the Stan team. (Our name and logo are trademarked, which is a different branch of IP law.)

When you redistribute the copyrighted works of others, you are legally required to respect the licensing terms. They almost always require you to cite the copyright holder and the license under which the copyrighted work is used.

If you try to combine code with multiple licenses into a single project, there's an issue of license compatibility and copyleft. Some licenses are fundamentally incompatible, like Apache 2 and GPL 2, but others are compatible, like GPL 3 and BSD-3. If all you do is redistribute each contribution under its own license, it makes it harder for people to use the project (they have to scan the license for everything they use), but it's otherwise OK to do that. If you have more complicated question, we'll have to get help from a real lawyer.

MansMeg commented 8 months ago

So the idea I think @avehtari had was to set the licence for each model and data in the database. So maybe the easiest would be to do that. Then you could filter out everything that has the licence that you are ok with?

bob-carpenter commented 8 months ago

You can create a repo with each component licensed under its own license. There just can't be an overall license unless you find one that's compatible. I'm not sure what that would mean relative to your writing Python or R code that compiles against those models. That's something I'd ask the NumFOCUS IP attorneys.

MansMeg commented 8 months ago

Ok. You mean our code? Thats mainly written by me so thats no problem.

So do I understand you correctly that you are happy with clear licences per model/posterior?

bob-carpenter commented 8 months ago

Yes, I mean the R and Python code. It has to be released under some license, but I don't know what the implications would be of it using a bunch of Stan code under different licenses. This is where license compatibility becomes an issue and also a reading of what it means to be a derivative product. If all of the models you are using are licensed under GPL v3 or BSD-3 or Apache 2 or MIT license or similar, you're OK with going with a BSD-3 license for your code. As soon as you try to include something with an incompatible license (e.g. homebrew "academic only" use license or GPL v2), I would urge you to ask NumFOCUS lawyers.

avehtari commented 8 months ago

posteriordb repo has R code only in tests directory and it seems that directory could be also removed. posteriordb repo has Python code for one PyMC model, but that is the model code. So it seems posteriordb needs just to have clear license per model code and per data. I don't think the posterior draws are under copyright and data license seems also bit silly as hey can be regenerated.

There are separate repositories posteriordb-r and posteriordb-python that have useful utilities for accessing data, model code, and reference posterior from anywhere on the internet. The license for those codes doesn't need to match the licenses in posteriordb, as well as web browser code doesn't need to match licenses of material the browser downloads.

The different model codes in posteriordb are not linked together, and it is just a code repository in the same way as CRAN. CRAN contains packages with different licenses including restrictive license which are incompatible with each other, but in general the CRAN packages are not combined, and in the same way the posteriordb models are independent from each other. Naturally, all the code need to have license that allows us to distribute that single code via the repository.

We do need to mention the licenses for each code in the repository, and remove hose codes for which the licence is not clear and we can't get the original author to license it with something suitable.

bob-carpenter commented 8 months ago

That makes sense, @avehtari. I hadn't realized posteriordb-r and posteriordb-python had been split out. If they're in different repos and don't distribute anyone else's copyright code and don't include posteriordb as a submodule, then I think you should be OK.

I think distributing a repo with a bunch of separately licensed codes is OK. I would personally stay away from anything other than the standard open source licenses, but that's your call.

I haven't thought about copyright on draws. You can only copyright things produced by human, but I don't know what the status of things produced by tools by a human.