Closed Floppy closed 9 years ago
We are introducing inferred addresses to the OA dataset.
Read this blog post for a high level introduction to the subject of address inference.
Inferred addresses will be published as any other address, but they will be marked as such in the address' 'metadata'. The provenance will also show relevant information specific to the inference process.
The first basic inference algorithm we will use is "multiple source, single edition". It's "multiple source" because it can use addresses sourced from different sources, e.g. Companies House's data, the OA website and Sorting Office. It's "single edition" because it refers to the latest edition of each source only, e.g. we won't use different editions of Companies House's data for the time being, but their latest only.
The algorithm is based on the following assumptions:
And this is the pseudocode built on the above to infer addresses:
for any 2 or more individual numeric PAOs or SAOs in the same street, town and postcode of non-inferred addresses
strip away the non-numeric parts from the PAOs and SAOs (e.g. 7A becomes 7)
if all PAOs and SAOs are odd then
infer all addresses corresponding to all missing odd PAOs included within the min and max known PAOs or SAOs, excluding both boundaries (e.g. 7 is not being inferred) and the known house numbers
else if all PAOs and SAOs are even then
infer all addresses corresponding to all missing even PAOs included within the min and max known PAOs or SAOs, excluding both boundaries and the known house numbers
else
infer all addresses corresponding to all missing PAOs included within the min and max known PAOs and SAOs, excluding both boundaries and the known house numbers
end if
end for
E.g.
How to manage localities? If a locality is specified for two or more addresses used to generate inferred addresses, then all inferred addresses belong to that locality, too. Alternatively, the locality field is left empty in the inferred addresses.
Inferred addresses are conceptually re-generated at every distillation (this is not an implementation specification!), they are the "result of that distillation", but their URIs should be persistent. In other words, the rationale for inferring address x should be verified at every distillation.
If, during a later distillation, a non-inferred address from any of our sources matches a previously inferred address, it should "take its place" (its URI).
Tested vs the ~1m addresses in the 2014-12-10 edition of the OA dataset, this algorithm infers ~3.5m new addresses.
Note that, as a matter of principle, we're not inferring addresses from other inferred addresses at this stage.
Provenance should point at the inference code that was used and the addresses that were used as an input specifically to infer that address. Of course, provenance should point at the addresses' URI, not their value. Also because of this, it is useful to keep the inference code very readable and as "self-contained" as possible.
Ideally, the software module running inference as part of the distillation process should be strongly modular, allowing non-OA contributors to easily plug in alternative or additional inference algorithms.
Just to be clear, are we running inference across ALL of the data, or are we only running it across newly added data? If it's newly added data, I'm assuming the flow goes something like this:
If we get ANOTHER address on that road, then I'm guessing we:
Is that right?
Also, as per our discussion about making this into a web service, I was thinking the best solution might be a web service that accepts pre split addresses as JSON, for example:
curl -H "Content-Type: application/json" -d '{"address":{"saon": nil, "paon": 123, "street": "High Street", "locality": nil, "town":"Somewheretown", "postcode":"ABC 123"}}' http://inference-bot.openaddressesuk.org
If, for example, we find 131 High Street, the service will then return the following:
{
"addresses": [
{
"saon": nil,
"paon": 125,
"street": "High Street",
"locality": nil,
"town":"Somewheretown",
"postcode":"ABC 123"
},
{
"saon": nil,
"paon": 127,
"street": "High Street",
"locality": nil,
"town":"Somewheretown",
"postcode":"ABC 123"
},
{
"saon": nil,
"paon": 129,
"street": "High Street",
"locality": nil,
"town":"Somewheretown",
"postcode":"ABC 123"
},
]
}
Does that sound sensible?
Oh hang on, you were talking about the API for the inference module; I was talking about for our service. I was imagining that the inference module would work independently, polling our service and sending addresses back to us. But you could do it the other way round, as a service that we ping every time there's a change, as you suggest.
I think in that case we should supply the data about all addresses in the street containing the changed address to the inference service.
(We can imagine in future that services would be able to subscribe to different types of web hooks eg to get whole localities at a time.)
Cool, I take the point about streets and postcodes. I think you're right, the addresses we use for inference must definitely have the same postcode.
We probably need to talk more about this during standup, but I don't think we'll be in a position to start on this today anyway.
Also, just had a thought. Could we have an inference bot running as an ETL on Turbot?
You mean turbot running a bot that does the inference job? I thought that was the plan. At one point anyway.
It may well have been :smile: In that case my brilliantly original idea was accidentally stolen from the past.
Postponed to sprint >=42, unfortunately. FYI @peterkwells.
So, I have http://jess.openaddressesuk.org/. At the moment, this infers addresses when given a JSON representation of an address. Next thing to do is to return provenance when the URL of an address is posted. I'm thinking if we post address=https://alpha.openaddressesuk.org/addresses/rYwoGk
for example we get back the response as with the json, but with a provenance section like so:
{
"inferred": {
... INFERRED ADDRESSES GO HERE ...
},
"existing": {
... EXISTING ADDRESSES GO HERE ...
}
"provenance": {
"activity": {
"executed_at": "2015-01-21T16:18:32+00:00",
"processing_scripts": "https://github.com/OpenAddressesUK/jess",
"derived_from": [
{
"type": "inference",
"inferred_from": [
"https://alpha.openaddressesuk.org/addresses/rYwoGk"
... URLs of existing addresses go here ...
],
"inferred_at": "2015-01-21T16:18:32+00:00",
"processing_script": "https://github.com/OpenAddressesUK/jess/blob/5d954baa0b91ed25c42fb060ad659ce68cdd2e45/lib/jess.rb"
]
}
}
As discussed today, the "time decay" component of the heuristic adjustments to the score will temporarily use as "age" of the data the time passed from the moment of ingestion. Later, it will be as described here.
We can now infer from addresses in the database, just POST
a token like so:
curl --data "token=rYwoGk" http://jess.openaddressesuk.org/infer"
And you'll get back a blob of JSON like so:
{
"addresses": {
"inferred": [
{
"saon": null,
"paon": 6,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 7,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 8,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 9,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 10,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 11,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 12,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 13,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 14,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 15,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 16,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 17,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 18,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 19,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 20,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 21,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 22,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 23,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 24,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 25,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 26,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 27,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 28,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
},
{
"saon": null,
"paon": 29,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE"
}
],
"existing": [
{
"saon": null,
"paon": 30,
"street": "FOGRALEA",
"locality": null,
"town": "SHETLAND",
"postcode": "ZE1 0SE",
"url": "http://alpha.openaddressesuk.org/address/1pyXsJ"
}
]
},
"provenance": {
"activity": {
"executed_at": "2015-02-27T16:50:30.330+00:00",
"processing_scripts": "https://github.com/OpenAddressesUK/jess",
"derived_from": [
{
"type": "inference",
"inferred_from": [
"http://alpha.openaddressesuk.org/address/VDtibW",
"http://alpha.openaddressesuk.org/address/1pyXsJ"
],
"inferred_at": "2015-02-27T16:50:30.330+00:00",
"processing_script": "https://github.com/OpenAddressesUK/jess/blob/535847cbe3cd9212a7743e0cf879f20faa50a114/lib/jess.rb"
}
]
}
}
}
Note that this only infers 'upwards', as this was what was in @giacecco's original specs, but it can easily infer both ways, just delete https://github.com/OpenAddressesUK/jess/blob/master/lib/jess.rb#L28
I don't get it: where does the token come from?
I had suggested @pezholio to rename the argument to something like "seed" but yes, it is not self explanatory. That is an address identifier as in https://alpha.openaddressesuk.org/addresses/rYwoGk . It's a seed as by referring to an address you're telling the inference engine to investigate its street and postcode combination.
In https://github.com/OpenAddressesUK/jess/pull/1 I've changed provenance for token lookups to use the ernest URLs, because that's what it actually derives from, and otherwise we end up with theodolite URLs in the core DB, which we don't want.
A next step on this is to move the inference querying to use the ernest DB directly instead of the published address file, but that needs confidence measures to be there and probably used in distiller.
Currently working on the hello-kitty bot which will use jess to load inferred addresses into ernest. However, having an issue with the updated_since queries on theodolite as the index doesn't seem to be right.
@james, re: https://github.com/theodi/shared/issues/504#issuecomment-77171002 , I get your point, you are right.
Just to be safe on the semantics: do ernest URLs today represent addresses "as they were ingested" rather than "as they were distilled"?
For consistency, though, why don't I find a references to the ernest URL in the distilled URL's provenance? E.g. https://alpha.openaddressesuk.org/addresses/UAdD1o.json is not referencing http://ernest.openaddressesuk.org/addresses/2935011 ?
What's the blocker on this?
As far as I can tell, the hello-kitty bot is ready to roll. Just need to let it run.
For OpenAddressesUK/roadmap#32. Will need to queue up based on ETL events.