openaddresses / openaddresses-ops

Issues-only repo for discussion of operational considerations for OA
6 stars 5 forks source link

Proposal: license principles #3

Open NelsonMinar opened 9 years ago

NelsonMinar commented 9 years ago

I've written up a proposal on license policy. I'd appreciate comments, edits, etc.

This proposal contains things that are changes or I suspect are not a consensus view. I've tried to call out the biggest ones with comments at the end. Feel free to disagree with me on the underlying intent of my proposal! In particular I'm arguing we should be promiscuous in accepting any redistributable sources no matter how restrictive their use requirements are. That's a big change but one I'd like to see made, in a way that still makes it easy for users to subset only the data with fewer restrictions.

migurski commented 9 years ago

I agree with basically all of it. A bunch of the recommendations all point in the direction of caching and including the license with the data downloads. Now that we are using .zip files for output, they should include a copy of the license in some text-like format.

sbma44 commented 9 years ago

My apologies for not seeing this until now. The Mapbox firehose puts hundreds and hundreds of tickets in my notifications tab every day, so when I'm not tagged I sometimes miss things for embarrassingly long amounts of time.

I think this is good, but my opinion does differ on a few things, as you anticipate. There's certainly nothing here I couldn't come to grips with, but let me press my case:

OA will collect any data whose license allows redistribution. Some source data has restrictions such as required attribution, non-commercial use only, or a viral licensing requirement on derived works. We will focus our efforts on collecting data that has few or no restrictions. But in principle anything redistributable could go in to OpenAddresses. (This statement is a change. Currently we don't include some data with significant requirements on derived works.)

The big one. My goal remains fixing the OSM geocoding guidance. I think this is important: OSM is where low-level admin boundaries, POIs and tasks like address conflation should live. I think it's quite powerful to be able to point to OA and say "look, this is the open data address project; in fact, the current geocoding guidance has made OSM irrelevant for open address data." Similarly, OA's status as the premiere open address data project gives it normative power, so that when people working on open address data come to us with an ODbL dataset we can say sorry but no, that data is not sufficiently open, and you have more work to do. This will have costs in data and hurt feelings but is the way that understanding will eventually spread.

I'd hate to lose these rhetorical tools. And I question the value of the ODbL datasets we might collect. AFAICT it's a few communities in France and Latin America, yes?

With all that said, I understand that I might be in the minority.

The primary publication of OA is the output address CSV files. This collection itself is licensed by CC0.

I think referring to the collection (and specifically the CSVs) as something we have rights to license is probably not a great idea -- it's both legally iffy and confusing to users. We can license the code and the JSON files, though.

We should consider grouping data by general license type. Pure public domain / CC0, requires attribution, non-commercial, and ODbL are four possible groupings.

:+1: -- I'll add that getting attribution strings ready for machine use in every attribution field is also something I'd love to see done. Ideally users would concat those fields, stick the result somewhere appropriate and be in compliance.

Ultimately responsibility for license clearance lies with our users, not us. We may need contractual language to enforce this, in which case CC0 will not be sufficient.

Agreed on both counts. The project materials should use a license with aims in line with BSD/MIT/ISC, but (sigh) it might be desirable to write something custom that specifically disclaims liability for unlicensed use of materials that OA collects. This can be rolled into https://github.com/openaddresses/openaddresses-ops/issues/1

migurski commented 9 years ago

I think referring to the collection (and specifically the CSVs) as something we have rights to license is probably not a great idea -- it's both legally iffy and confusing to users. We can license the code and the JSON files, though.

I interpreted this to mean that the description of the collection rather than its content is what we’re licensing CC0. I think we are all in agreement?

:+1: -- I'll add that getting attribution strings ready for machine use in every attribution field is also something I'd love to see done. Ideally users would concat those fields, stick the result somewhere appropriate and be in compliance.

Ooh yeah. Maybe it’s a question of building out the license metadata: link, rights, attribution, etc.

NelsonMinar commented 9 years ago

Thanks for the feedback Tom! I've been a bit bummed not to make any progress on this proposal, your comments are really great.

I'm sympathetic to the idea that ODbL is not open enough to include in OA. To me the sticking point with the ODbL is the viral requirement on derivative works and what that means for geocoding applications. I understand that's "being clarified" with OSM, but to me any sort of viral license is really not suitable for an open data application. I don't want users of OA data being compelled to license their own stuff in any way at all. (Honestly, even attribution feels like asking a lot.) My initial proposal of multiple sets with different licenses was an attempt to include all possible data while still allowing users who don't want ODbL restrictions to get an ODbL-free version of the data. But maybe it's stronger for us to take a stance that ODbL data isn't worth including at all.

If we exclude ODbL we lose a very few primary sources; all I know about is a few French towns. The main thing we lose is data from OSM, a hypothetical extract of their address points. I'm OK with that, particularly if folks like my issue #2 proposal where OA's purpose is as a repository of sources with specific authority. But if we go that direction then I think we should articulate what "open enough to be in OA" does mean.

migurski has it right on what I meant by "This collection itself is licensed by CC0"; I meant whatever copyright we might have on the collection we produce, rather than the underlying facts. But I'm out of my depth in understand database copyrights here and will gladly defer to legal expertise. I just want the license for the IP the OA project itself creates to be as permissive as possible.

migurski commented 9 years ago

For myself, I’d prefer to include things and mark them with license terms than to exclude things because they’re virally-licensed.

iandees commented 9 years ago

I agree that whatever data we have generated (the source list is the main thing I'm thinking of) should be licensed CC0.

In the interest of accepting any and all data we possibly can, I would tend to agree with your latest comment, @migurski: we should accept virally-licensed data sources.

I have the "feeling" that the vast majority of current data sources (probably all of them?) are CC-BY at the most, and we should strive to keep our output product at that level by not including virally-licensed data in the default output product.

NelsonMinar commented 9 years ago

I do like the idea of OA making a principled statement that ODbL data is not "open enough" for our goals. If we do decide to include ODbL or other challenging licenses, perhaps we should keep it out of the primary product we publish and just have it as an adjunct? A precedent for this split is Debian and Ubuntu's "non-free" repository, which the user has to manually add to their system.

sbma44 commented 9 years ago

I meant whatever copyright we might have on the collection we produce, rather than the underlying facts. But I'm out of my depth in understand database copyrights here and will gladly defer to legal expertise. I just want the license for the IP the OA project itself creates to be as permissive as possible.

Got it. I think we're all in agreement in the goal, but I'm still worried that saying "these CSVs are CC0" is leading a lot of people to incorrect conclusions about how they can use them (it certainly confused me when I first started using the project). Maybe we need to just sidestep this and say something like

The OpenAddresses project asserts no copyright, database right or other intellectual property interest in the database(s) it makes available. Depending on the source and your own jurisdiction, your use of the data may be subject to license terms concerning the use of the data itself. OpenAddresses makes no guarantees regarding the accuracy, applicability or legal status of any information served through or organized by the project. Use of the material is at your own risk.

(I can get an actual lawyer to do a version of this in a few weeks)

sbma44 commented 9 years ago

I have the "feeling" that the vast majority of current data sources (probably all of them?) are CC-BY at the most, and we should strive to keep our output product at that level by not including virally-licensed data in the default output product.

Yeah, I feel pretty comfortable affirming this: CC-BY and licenses like it are by far the most popular way of releasing this kind of data, based on the international research I've done. The value to be had by allowing ODbL seems marginal to me--we thought it was freezing us out from France, but that turned out to be essentially a trick by ODbL activists in France. Episodes like that make me think that an aggressive position on acceptable licensing is really important. OpenAddresses is currently one of the most effective arguments for dropping ODbL in OSM and elsewhere. Without projects like this one making a point of the problems, it's too easy to dismiss ODbL criticism as self-interested corporatism (though to be clear I spent time opposing sharealike while at Sunlight, too).

(Also, not the biggest deal, but: I'll be speaking at FOSS4G about this stuff, to an audience of hopefully-receptive Asian geodata people, and it will kind of nerf the talk if I'm obliged to leave ODbL on the menu of licenses they should/can be thinking about.)

With all of that said, the compromise that seems to be emerging certainly strikes me as reasonable (and I appreciate it!). The cost/benefit just seem insufficient.

migurski commented 9 years ago

I do like the idea of OA making a principled statement that ODbL data is not "open enough" for our goals.

For what it’s worth, I don’t share this view. If an authority chooses a license allowing redistribution but stipulates that attribution or share-alike are required, that’s up to them. I’m curious Ian and Nelson what it would mean to keep ODbL out of our primary product. That data would still be an individual download, but maybe you mean that it would not be included in grouped extracts?

migurski commented 9 years ago

I like the language in the “asserts no copyright” paragraph you suggest, Tom.

NelsonMinar commented 9 years ago

That data would still be an individual download, but maybe you mean that it would not be included in grouped extracts?

Something like that? In the proposal document I suggest

We should consider grouping data by general license type. Pure public domain / CC0, requires attribution, non-commercial, and ODbL are four possible groupings.

I guess I have in mind we still have individual-source.csv files, one per source. Then we have one giant openaddresses-free.csv that does not include ODbL data, and another openaddresses-odbl.csv that only includes ODbL. And maybe more groups than that, one per license. Proliferation becomes a problem if we also want to publish geographic groups; N regions times M licenses gets to be a lot of files.

migurski commented 9 years ago

Yes, that makes sense. The two responsibilities I’m familiar with, attribution and share-alike, seem to come in BY and BY-SA but rarely SA-only variants, so how about groups like oa-all.csv, oa-by.csv, and oa-by-sa.csv with the last one containing possible ODbL sources?

I see a license terms language hunt in our future.

migurski commented 9 years ago

The optional license tag may need to grow to accommodate more than a string, e.g.:

"license": {
    "url": "http://geonb.snb.ca/downloads/documents/geonb_license_e.pdf",
    "attribution-string": "GeoNB – www.snb.ca/geonb",
    "attribution": true,
    "share-alike": false
}
sbma44 commented 9 years ago

I'd like to lobby hard for not combining into oa-all. This would effectively make the largest and therefore probably most-popular download a sharealike product. That would be a complete surrender on this point. Collecting this stuff and keeping it siloed I can live with, having OA become viral sharealike by default... yikes

migurski commented 9 years ago

Yeah, I like that.

So the default product could have no restrictions, and you could choose to accept additional restrictions by downloading something else? In terms of restrictions, oa.csv < oa-by.csv < oa-by-sa.csv.

In terms of data volume, maybe the most restrictive download also includes the freely-licensed stuff. It’s understood as a signal that the contents might include share-alike terms. So, oa.csv < oa-by.csv < oa-by-sa.csv also.

NelsonMinar commented 9 years ago

I mildly agree with sbma44 that an oa-all is undesireable. Fortunately CSV files can just be concatenated together. To me that suggests having separate disjoint datafiles, oa.csv, oa-by.csv, and oa-by-sa.csv. Users can download whichever ones they work for them and combine them.

sbma44 commented 9 years ago

yeah, I think keeping things distinct is preferable.

migurski commented 9 years ago

:+1:

migurski commented 9 years ago

If all goes according to plan, we’re going to be seeing README files with links to licenses in the OA downloads pretty soon. Our next move here would seem to be a classification of these licenses to support the BY/BY-SA download ideas above.