openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.05k stars 416 forks source link

Licensing question: is using the OSM-trained model with proprietary data allowed? #291

Closed antimirov closed 6 years ago

antimirov commented 6 years ago

Hi guys,

Only recently I've learned about such a great library as libpostal. Thank you for it.

Not sure if it's okay to post it here as github issue. I have a question from one of my colleagues. He is concerned by one of the comments in one of the issues on this github page. It was about training the model on own data and the reply from one of the libpostal developers was that:

It's not recommended to train on your own dataset, because then the bug reports cannot be accepted (input affects output). And also, I think, it can be non-trivial to pre-process the data yourself and then train properly.

But in the case of the company that sells a geocoding service (based on 100% own address+xy data) using libpostal for some of the initial parsing/expansion of input address, what's the license situation here? Does the license apply to the model files which are downloaded during 'make' stage? Do the following files for example contain OSM data that is under 'CC-BY'?

Where can I read more about legal statuses of NLP models derived from OSM?

Thank you!

albarrentine commented 6 years ago

Hi Eugene - as far as I'm aware, this is the first publicly-released NLP model built from OSM, so not sure there's much available on that specific topic, but there has been plenty of legal analysis done on the ODBL and which use cases trigger its share-alike obligations (see: http://wiki.openstreetmap.org/wiki/Category:Open_Data_Licence/Community_Guidelines). Also, generally, most parametric machine learning models can be thought of as analyses of data i.e. the original data cannot be re-constructed from the statistics or learned parameters.

As an example, consider a countrywide Census which might collect some personally-identifiable information about citizens. While publishing the individual's data (or a strong proxy for it) would violate people's privacy, there'd be no legal issue with publishing a median statistic or other aggregation for a particular area, as long as the population is sufficiently large that the statistics do not directly identify people. Requiring that analyses be on the same terms as the underlying data would make most research impossible.

In machine learning and statistics, we have the notion of a parametric model vs. a non-parametric model. Sticking with the Census example, a parametric model would be publishing statistics like the mean and median, which are parameters that are fit to the data. A non-parametric model (with the exception of Bayesian non-parametrics, where the term is used slightly differently) would be if we kept every Census record in the training data around, and we wanted to fit a local "neighborhood" model, where we'd first find the nearest neighbors for a candidate record, and then build a local model for those few neighbors. In that case, the model does need to keep data around, and would trigger a share-alike license. However, libpostal, and most supervised machine learning models you'll come across, are parametric i.e. they don't store the original data, and thus are not related to the terms of the underlying data.

The model files which are downloaded as part of installing libpostal contain only learned representations, and do not trigger the share-alike obligations of the ODBL simply because we don't store any OSM data, just a learned representation of how various words and phrases relate to address components. There's no way to reconstruct the original OSM addresses from our data, thus we're not even in the realm of where the ODBL applies, or at the very most libpostal would constitute a Produced Work (see OSM Legal FAQ). It would be like if you read every address in OSM, learned that words ending in "...straat" are usually street names in the Netherlands, and wrote that fact down, not the individual evidence for how you learned that. You can use libpostal freely to parse (and now dedupe) addresses from your proprietary data, commercial or non-commercial.

We've also released the training data itself, which is under ODBL in the case of the OSM data, for machine learning researchers to use. Republishing the training data or some derivative database from it triggers ODBL share-alike. Building a model from it does not.

The comment I've made (there's only one primary developer for this project along with some occasional contributors) in several issues, is that if people train custom versions of libpostal's model on their own proprietary data, and the resulting parser is not as accurate as they'd hoped, I a) cannot diagnose what went wrong because the data's not available to look at without signing an NDA (bleh...), b) it would not be reasonable to expect my free labor in cleaning up everyone's proprietary address data set, and c) even if I were interested in doing it for money, most of our users would be unable to afford the consulting rate of a New York data scientist (which is why open source is great from the user perspective - many more people have access to this work than the relatively small group of companies who could pay to have something similar developed internally, which would never see the light of day).

As such, this is why I suggest that if people want to train a custom model from libpostal, they should make sure at least one person on their team has at least an intermediate understanding of NLP and sequence models, or is willing to invest the time in learning.

That's all to protect my own scarce time, not because of licensing/ODBL concerns.

antimirov commented 6 years ago

Thank you for this comprehensive response!

Erbond12 commented 1 day ago

I'm not sure if this fits or will even be seen by the right person but I thought it goes with the topic.

Does libpostal collect any data on each parse/expand call? I am sadly not fluent enough in C to check myself if there is some code there for that. I also could not find anything about this topic.

This would be good to know to write legalities accordingly. Thank you!