Open geobrando opened 9 years ago
Unfortunately there is no general rule for scraped data. Under US law you've got a decent chance claiming address data doesn't enjoy copyright protection (or so I've been told by IP lawyers), but the CFAA could still be relevant. Or who knows what laws in the jurisdiction you're scraping.
In the case of Canberra, the Australian Capital Territory has gone to the trouble of setting up an open data portal. If you can find a blanket open data policy, that might be good enough. I've found some geodata portals that have completely unclear terms but which exist as subdomains under a government website with a lovely CC icon on it -- good enough for me. I went looking for something similar in this case, though, and was unable to find one. I'll also note that I spoke with some ACT personnel about a year ago during my initial Australia research -- at that time, at least, their data had some complicated license connections that they were trying to untangle.
If an open data document can't be located and no one will give clearance via email I think it's probably fine to collect but to note the circumstances in the license
field.
None of our sources are really "scraped" in the sense that we're extracting data from a website that doesn't make it readily accessible. We're not downloading HTML and extracting information out of it: we're using an API that the municipality has set up and exposed for programmatic use.
The polite thing to do would be to ask for explicit permission (and heck, while we're at it we could ask for a bulk data download if it would ease their resource burden), but there are readily-documented ways of wrapping access control around these endpoints. Wouldn't they use it if they didn't want it open and available?
^^^ this is an important point and one that I shouldn't have glossed over. I don't think that using a mapserver endpoint is quite the same as scraping.
With that said there are nonstandard interfaces that I think we could and maybe should consider scraping, so worth discussing.
Agreed that there should not be a legal issue extracting data from a web service that was created to allow programmatic access. I suppose the greater question I was posing about licensing applies to any data whether is be from a service endpoint or a direct download, when we are unable to determine the license terms.
OK, so if there's no general rule we can follow when the licensing is unknown, I guess I'm wondering two things:
@sbma44 Given that we'd want to be responsive to takedowns, should the project register an agent with the copyright office to direct notifications to?
As I recall registering (and being approved for DMCA Safe Harbor) is necessary to receive safe harbor protection.
Yes, we definitely should. We’ve previously discussed incorporating and registering.
Our lawyer's been swamped lately but I will inquire about getting this process going.
OK, DMCA side of things is pretty straightforward: http://copyright.gov/onlinesp/agent.pdf
I am waiting for an answer on whether we have to be incorporated first or not, but then can proceed with the paperwork (Mapbox can pick up the fee). Who should be the designated agent? @iandees? Happy to do this myself if it's an imposition. AFAIK we have yet to get a single notice since Mapbox registered, so I would expect this to be (very) low-volume.
Hi All. Sorry to revive an old thread but in trolling the repo(s) this looks like the right place to discuss license issues. Looking through some of the US data we've found in most cases License: Unknown. Digging deeper we have seen that this can be inconsistent, and in at least a couple of cases contradictory to text listed on the source web sites. Has this raised any red flags for anyone? Specifically thinking about the commercial vs non-commercial conditions.
Hey @jharpster, this is the right place to talk! It’s true that a lot of the U.S. data has undefined license terms. We’ve largely moved on the assumption that government-sourced data is free for use and redistribution unless we see otherwise. It’s not raised any flags so far, but it’s good that you’re asking.
The suggestion of a new NC flag for licenses in each source (alongside the existing BY and SA) has been made, but not acted upon. If you have any advice or suggestions, we’d love to hear them.
Hi @migurski, I think you're assumption is the right one to make since publication of the data is the intent of public API's. Adding a NC flag alongside BY and SA data might be a slippery slope towards some really thorny licensing issues beyond the scope of the project.
It might be worth adding an caveat on the download page since much of the data is not, strictly speaking 'freely shareable'. Because the data is scraped and not user curated I suppose any update plans should include updating source links and license terms as well.
Per my discussion with @iandees and @slibby in https://github.com/openaddresses/openaddresses/pull/1363#issuecomment-150688521 and as an extension to https://github.com/openaddresses/openaddresses-ops/issues/3, opening this issue to spur discussion about the license terms that should be assumed for addresses scraped from servers when no explicit license or copyright terms exist in the service description.
This is really out of my area of expertise, but I suspect that @sbma44 or @ajturner might have an informed opinion on this. The web scraping Wikipedia page has information on existing case law that might be relevant here. Based on the comment made by @ingalls in regards this this Russian source, I suspect that this might need to be a determination made for each country.