License Keyword - Githubissues

ax3l commented 4 years ago

Add a new data license keyword to openPMD /.

This can be used to express open access (creative commons) licenses et al. and avoids decoupling this important information from the medium the data is stored in.

Short-hand identifiers are defined in SPDX as well: https://spdx.org/licenses/

And we should keep a default/free text option for whatever non-free stuff people come up with.

Proposed / keyword (required):

license
- type: (string)
- description: the data license for this openPMD series; licenses should use the SPDX identifier, other license terms with other:
- default: other:unknown
- examples:
- CC-BY-4.0 for the Creative Commons Attribution 4.0 International License
- CC-BY-SA-4.0 for the Creative Commons Attribution Share Alike 4.0 International License
- CC0-1.0 for the Creative Commons CC0 waiver
- other:unknown this value means no information were provided by the data creator(s) about the restrictions or rights to use this data
- advice to data creators: Do I have to select a license? No, but it is highly encouraged to do so if you want to share/publish your data and want it to be reused.

DavidSagan commented 4 years ago

Hmmm. I'm not sure what the rules are in other countries but in the USA it is not possible to copyright information. Only the expression of that information can copyrighted.

ax3l commented 4 years ago

(Disclaimer: I am not a lawyer. This is no legal advice. Check for your specific use case.)

Are you sure? What does information include in that case? Figures (2D pixel meshes), maps, databases, etc. can easily contain created data with enough originality that falls under general IP and copyright, as far as I know.

Even if something is not applicable to copyright in a certain country then e.g. open data licenses cover more than this: https://en.wikipedia.org/wiki/Open_data Adding an explicit license can clarify the situation for people that try to use the data, e.g. in automated workflows (e.g. data mining) for meta-studies and ML.

Some background:

https://en.wikipedia.org/wiki/International_copyright_treaties "all creative works as soon as they are fixed in a medium"
OSM data base: https://www.openstreetmap.org/copyright
https://www.copyright.gov/reports/db4.pdf p. 10: "independent creation plus a modicum of creativity [...] To be sure, the requisite level of creativity is extremely low; even as light amount will suffice. The vast majority of works make the grade quite easily, as they possess some creative spark, ‘no matter how crude, humble or obvious’ it might be."

The only discussion I can find in the US is about databases that are a compilation of other works.

DavidSagan commented 4 years ago

Quite sure. For example from https://libguides.library.kent.edu/data-management/copyright :

Data are considered "facts" under U.S. law. They are not copyrightable because they are discovered, not created as original works. However, other intellectual property protections may be utilized to protect your work and ensure proper attribution.

Although data itself cannot be copyrighted, you may be able to own a copyright in the compilation of the data. Creative arrangement, annotation, or selection of data can be protected by copyright. Patent law may apply if your data collection leads to new and useful inventions such as machines, processes, manufactures, or improvements. Your data may be protected by trade secret if your formula, process, design, or method offers a commercial advantage. Keeping in mind that some contracts or grants come with non-disclosure agreements or other conditions requiring secrecy.

The standard example is the case of a telephone directory. You can copyright the layout of the directory but you cannot copyright the data so someone else is free to publish the same data as long as they use a different layout. For openPMD based files the layout is mandated by the openPMD standard so I do not believe anyone would have any copyright ownership on an openPMD based file.

Notice that I am talking about US law only. In the US there are no database rights. In the EU there is.

ax3l commented 4 years ago

Interesting, thanks! Yes, seems very different in other countries.

As found by the copyright.gov link above, "enhancing" a database with coprighted material is another thing that could trigger copyright. For example, if I carefully pre-select particles I write in a file, store a post-processed result again in openPMD, etc. this is likely to trigger copyright.

The telephone directory did not source that information, we in most cases do.

Either way, keeping the default other:unspecified is fine but won't improve the situation. Our audience is international anyway. Adding an explicit open data license is very common for scientific data, also in the US. Examples:

https://figshare.com
https://www.data.gov, e.g. https://catalog.data.gov/dataset/federal-student-loan-program-data uses CC0-1.0
https://www.zenodo.org

DavidSagan commented 4 years ago

After mulling it over a bit, I am against having a data license field in openPMD. If a person wants to keep their data private so be it. But if a openPMD dataset is shared I do not want to have to worry about rights. I do not like the idea that my actions in using an openPMD data set may set me up for being sued.

ax3l commented 4 years ago

A data license, just as open data itself as well as open source does never imply one is forced to publish this data not even any derived works. This is a common misunderstanding of open source and open data and indeed against the freedoms of open-X standards.

if a openPMD dataset is shared I do not want to have to worry about rights.

In most countries, default copyright will apply as would be with other:unspecified. Indeed, if you share data (private or publicly) as part of your work you already have to worry about rights, at least with your employer. Default copyright can imply that the receiver is potentially not allowed read, reproduce, derive, or do anything with the data unless explicitly permitted. You can still do that explicit permission when you share/publish the data ("email: here is the data you asked for." implies reading is fine), nothing changes.

Using well-established licenses just simplifies the situation, nobody has to use them. It just makes clear what the situation is - other:unspecified - nothing is specified, one has to think/inquire for oneself. Maybe another attribute value would highlight this better?

DavidSagan commented 4 years ago

A data license, just as open data itself as well as open source does never imply one is forced to publish this data.

I never said anything about mandatory sharing.

Using well-established licenses just simplifies the situation, nobody has to use them.

Actually I believe the opposite is true.

ax3l commented 4 years ago

I never said anything about mandatory sharing.

Oh then I misunderstood. I thought you mean this:

If a person wants to keep their data private so be it.

Which is totally fine, even with a specific license. (License terms can contain MoU, commercial licenses, etc. as well for some users. If the writer of the file does a valid claim is not our business.)

DavidSagan commented 4 years ago

The bottom line for me is still that if someone hands me a datafile I do not want to have to check if a certain field in a data file gives me permission to use it.

ax3l commented 4 years ago

Not a lawyer, but if a collaborateur hands you a data file this means you got an implicit, non-exclusive usage right to read it and nothing further, just as one would expect.

They could still keep the data file closed otherwise and license it elsewhere.

The use case is really the opposite, e.g. people that aggregate/crawl/scan data for meta-studies, training, etc. and cannot do a manual contact-and-inquire workflow for many individual data sets.

DavidSagan commented 4 years ago

Well if you want the standard could be amended to say that by default any data file that uses the standard has no license restrictions and that if any restrictions are to be placed, the restrictions have to be transmitted externally along with the file. I just do not want to be forced to look in the file itself.

ax3l commented 4 years ago

I just do not want to be forced to look in the file itself.

I see your point. But just as before, we will not force people to make their data essentially public domain.

Well, then it's easy: we limit entries to clearly defined FSF and OSI approved licenses and everything else is other:, which we document to raise a warning in data readers. https://spdx.org/licenses

Data readers can also decide to abort on anything but the former and other:unspecified.

tgamblin commented 4 years ago

Just going to jump in here to clarify some points.

RE: @DavidSagan:

Hmmm. I'm not sure what the rules are in other countries but in the USA it is not possible to copyright information. Only the expression of that information can copyrighted.

True. However, a file or a specific output is an expression of information.

RE:

But if a openPMD dataset is shared I do not want to have to worry about rights. I do not like the idea that my actions in using an openPMD data set may set me up for being sued.

There seems to be some misunderstanding here. US copyright law is very clear that by default, all rights are reserved on any copyrightable work. Open source licenses are necessary for that reason -- they grant the right to copy copyrighted work, they waive things like implicit warranties, etc. Without a license, someone can sue you for doing anything at all with their data or code, even if they posted it to GitHub or some other public site.

It seems like you're arguing that "raw facts" are not copyrightable, and that since the OpenPMD format is open and OpenPMD data is just physics facts, no OpenPMD files will ever be copyrightable. That's dubious at best -- you can find cases like this one with all kinds of arguments over the copyrightability of output. Given that it takes some serious knowhow to set up an OpenPMD run, I think it would be easy for someone to claim that the output of their particular run is copyrightable. It would take a lot more legal precedent than currently exists to prove that OpenPMD data are "pure facts", and that it doesn't take some ingenuity to select an interesting problem from the entire space of OpenPMD inputs. Moreover, I'm pretty sure OpenPMD outputs are not a pure function of the inputs. The machine and environment very likely matter at least somewhat.

Anyway, the specific arguments don't really matter. Because the default is that all rights are reserved, the burden is on the user to show that OpenPMD files are "facts" they can "just use". So it's the unlabeled case where the IP rights are murky. See the Open Data Commons FAQ:

Do I need this legal stuff, can’t I just post my data online? The simple is: no — and yes you do need this legal stuff. Whether one likes it or not there are a whole bunch of jurisdictions in the world where there are IP rights in data(bases). Thus if you want your data to be open, even if that means public domain, you need to apply a license (or something very like a license).

So, RE:

The bottom line for me is still that if someone hands me a datafile I do not want to have to check if a certain field in a data file gives me permission to use it.

You already have to check. A license as proposed simplifies the process. Without it, you have to check with the author by email or something similarly cumbersome. With it, the rights are clearly enumerated, as with open source software licenses, and all you have to do is look at a familiar SPDX descriptor.

ax3l commented 4 years ago

Thank you for the thoughts, feedback and context.

I updated the proposed text in the description accordingly.

DavidSagan commented 4 years ago

But if a openPMD dataset is shared I do not want to have to worry about rights. I do not like the idea that my actions in using an openPMD data set may set me up for being sued.

There seems to be some misunderstanding here. US copyright law is very clear that by default, all rights are reserved on any copyrightable work...

It seems like you're arguing that "raw facts" are not copyrightable, and that since the OpenPMD format is open and OpenPMD data is just physics facts, no OpenPMD files will ever be copyrightable. That's dubious at best

OK so here we need to separate US law from, say EU law.

For US law indeed facts are not copyrightable. From https://www.copyright.gov/help/faq/faq-protect.html:

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed."

The article you site is about copyrightable expression. It is not about facts.

The openPMD syntax represents expression of facts and so is copyrightable. However, if someone creates an openPMD file, they cannot claim copyright since they do not have a copyright on the openPMD syntax. So in the US, someone who creates and distributes an openPMD file will not be able to claim any rights to the file.

For EU law there is the concept of Database Rights (https://en.wikipedia.org/wiki/Database_right). This is not a copyright but a separate right. Like the US, copyright cannot be claimed on an openPMD file but a database right can be claimed.

You already have to check. A license as proposed simplifies the process. Without it, you have to check with the author by email or something similarly cumbersome. With it, the rights are clearly enumerated, as with open source software licenses, and all you have to do is look at a familiar SPDX descriptor.

For my way of thinking it is less cumbersome for any database rights notices to be exterior to the data file(s). I don't want my running programs to have to worry about checking for rights. The proper place for notifying people about possible rights problems is well before any programs are run. Also when someone creates a file I don't want my programs to have to worry about asking for what kind of license they want to use.

If you are really serious about database rights think about the consequences. At least in accelerator physics data files that are passed around never have rights notices so if an accelerator physicist where forced to check about rights for every data file they get, this would represent a horrible waste of time and effort. Therefore I am strongly opposed having a licensing field.

In fact there is a solution where no one has to ever worry about database rights. The solution is to use copyleft and have the openPMD standard mandate that database rights cannot be asserted on openPMD files. Of course this would not prevent someone keeping their data private if they want.

ax3l commented 4 years ago

I fear you still assume that binary files that some of us sometimes call data bases are not subject to copyright. This ain't the case.

For US law indeed facts are not copyrightable.

That is not correct. US law explicitly talks about "Uncreative collections of facts". That is very different from mere facts.

The binary files in question here do not distribute only indices of data (such as library book lists or telephone books) but actual data as well.

JPEG's, movie streams, et al. are all copyrightable data. If I convert a movie to another format, be it different encoding or an HDF5 file, this does not remove the "creative spark" of a filmed scene that triggers copyright. Filming a river will not create "uncreative facts" either.

The same is true if a scientist comes up with a simulation setup or a measurement setup. The output of such a simulation is more than the cited "Uncreative collections of facts" that are excluded in US right. Again, the point that discriminates is creativity in the copyright sense, which intentionally is an extremely low burden and not bound to the medium that records it nor anything that needs to be considered scientifically creative (novelty, variation, etc.).

It's also not our burden to implement a programmatic verification of license meta-data in our I/O routines. A scanner-printer also does not check if I am replicating a copyrighted image. An e-mail program does not check if one sends out a copyrighted .mp3 file. That's the task of users that combine data and applications. If they ignore where the data came from, aka if they do not know the rights or even authorship, then everything stays as is. This does not mean that they had the rights before or that it is good practice to pass around files from unknown sources (for more reasons than copyright). But we can assist them by giving a standardized index to check against, if they choose so.

If we want to improve the situation there is a simple solution: use licenses and code into programs to select a data license when creating output. For example, explicitly licensing all created data as CC0 license is as close as it gets to "don't worry" (inform your users about this in the software license & input). As meta-data standard, we have no saying about the rights of the data, as already argued above. Our meta-data standard is licensed with a permissive CC-BY license, which, simply put, only covers our creative arrangement of the data/text which we specify herein.

ax3l commented 4 years ago

Just for your information, the "mere physical facts" collected at CERN are also licensed appropriately: http://opendata.cern.ch/record/201

They have though out and automated the handling of these unavoidable consequences.

openPMD / openPMD-standard

License Keyword #219