License terms for scraped addresses

geobrando commented 9 years ago

Per my discussion with @iandees and @slibby in https://github.com/openaddresses/openaddresses/pull/1363#issuecomment-150688521 and as an extension to https://github.com/openaddresses/openaddresses-ops/issues/3, opening this issue to spur discussion about the license terms that should be assumed for addresses scraped from servers when no explicit license or copyright terms exist in the service description.

This is really out of my area of expertise, but I suspect that @sbma44 or @ajturner might have an informed opinion on this. The web scraping Wikipedia page has information on existing case law that might be relevant here. Based on the comment made by @ingalls in regards this this Russian source, I suspect that this might need to be a determination made for each country.

sbma44 commented 9 years ago

Unfortunately there is no general rule for scraped data. Under US law you've got a decent chance claiming address data doesn't enjoy copyright protection (or so I've been told by IP lawyers), but the CFAA could still be relevant. Or who knows what laws in the jurisdiction you're scraping.

In the case of Canberra, the Australian Capital Territory has gone to the trouble of setting up an open data portal. If you can find a blanket open data policy, that might be good enough. I've found some geodata portals that have completely unclear terms but which exist as subdomains under a government website with a lovely CC icon on it -- good enough for me. I went looking for something similar in this case, though, and was unable to find one. I'll also note that I spoke with some ACT personnel about a year ago during my initial Australia research -- at that time, at least, their data had some complicated license connections that they were trying to untangle.

If an open data document can't be located and no one will give clearance via email I think it's probably fine to collect but to note the circumstances in the license field.

iandees commented 9 years ago

None of our sources are really "scraped" in the sense that we're extracting data from a website that doesn't make it readily accessible. We're not downloading HTML and extracting information out of it: we're using an API that the municipality has set up and exposed for programmatic use.

The polite thing to do would be to ask for explicit permission (and heck, while we're at it we could ask for a bulk data download if it would ease their resource burden), but there are readily-documented ways of wrapping access control around these endpoints. Wouldn't they use it if they didn't want it open and available?

sbma44 commented 9 years ago

^^^ this is an important point and one that I shouldn't have glossed over. I don't think that using a mapserver endpoint is quite the same as scraping.

With that said there are nonstandard interfaces that I think we could and maybe should consider scraping, so worth discussing.

geobrando commented 9 years ago

Agreed that there should not be a legal issue extracting data from a web service that was created to allow programmatic access. I suppose the greater question I was posing about licensing applies to any data whether is be from a service endpoint or a direct download, when we are unable to determine the license terms.

OK, so if there's no general rule we can follow when the licensing is unknown, I guess I'm wondering two things:

Are we comfortable redistributing the data when we're unsure of the licensing?
Along the lines of what @sbma44 wrote, should we be issuing some kind of disclaimer stating that the license terms are unknown along with the data?

sbma44 commented 9 years ago

I think this has to be yes. We make a best effort to establish and communicate terms and we respond quickly to takedown requests. But agencies often have contradictory policies or published statements.
Definitely in favor of clearer statements of licensing policy. Being US-based puts us in a good position though, especially if we're responsive to takedowns.

riordan commented 9 years ago

@sbma44 Given that we'd want to be responsive to takedowns, should the project register an agent with the copyright office to direct notifications to?

As I recall registering (and being approved for DMCA Safe Harbor) is necessary to receive safe harbor protection.

migurski commented 9 years ago

Yes, we definitely should. We’ve previously discussed incorporating and registering.

sbma44 commented 9 years ago

Our lawyer's been swamped lately but I will inquire about getting this process going.

sbma44 commented 9 years ago

OK, DMCA side of things is pretty straightforward: http://copyright.gov/onlinesp/agent.pdf

I am waiting for an answer on whether we have to be incorporated first or not, but then can proceed with the paperwork (Mapbox can pick up the fee). Who should be the designated agent? @iandees? Happy to do this myself if it's an imposition. AFAIK we have yet to get a single notice since Mapbox registered, so I would expect this to be (very) low-volume.

jharpster commented 7 years ago

Hi All. Sorry to revive an old thread but in trolling the repo(s) this looks like the right place to discuss license issues. Looking through some of the US data we've found in most cases License: Unknown. Digging deeper we have seen that this can be inconsistent, and in at least a couple of cases contradictory to text listed on the source web sites. Has this raised any red flags for anyone? Specifically thinking about the commercial vs non-commercial conditions.

migurski commented 7 years ago

Hey @jharpster, this is the right place to talk! It’s true that a lot of the U.S. data has undefined license terms. We’ve largely moved on the assumption that government-sourced data is free for use and redistribution unless we see otherwise. It’s not raised any flags so far, but it’s good that you’re asking.

The suggestion of a new NC flag for licenses in each source (alongside the existing BY and SA) has been made, but not acted upon. If you have any advice or suggestions, we’d love to hear them.

jharpster commented 7 years ago

Hi @migurski, I think you're assumption is the right one to make since publication of the data is the intent of public API's. Adding a NC flag alongside BY and SA data might be a slippery slope towards some really thorny licensing issues beyond the scope of the project.

It might be worth adding an caveat on the download page since much of the data is not, strictly speaking 'freely shareable'. Because the data is scraped and not user curated I suppose any update plans should include updating source links and license terms as well.

openaddresses / openaddresses-ops

License terms for scraped addresses #8