theodi / shared

Repo that we use for non-repo-specific stories and other shared stuff.
22 stars 1 forks source link

Add inference to distillation process #504

Closed Floppy closed 9 years ago

Floppy commented 9 years ago

For OpenAddressesUK/roadmap#32. Will need to queue up based on ETL events.

giacecco commented 9 years ago

We are introducing inferred addresses to the OA dataset.

Read this blog post for a high level introduction to the subject of address inference.

Inferred addresses will be published as any other address, but they will be marked as such in the address' 'metadata'. The provenance will also show relevant information specific to the inference process.

The first basic inference algorithm we will use is "multiple source, single edition". It's "multiple source" because it can use addresses sourced from different sources, e.g. Companies House's data, the OA website and Sorting Office. It's "single edition" because it refers to the latest edition of each source only, e.g. we won't use different editions of Companies House's data for the time being, but their latest only.

The algorithm is based on the following assumptions:

  1. If two addresses belong to the same street, town and postcode and have a house number, all other house numbers between those addresses' house numbers belong to the same street, town and postcode, too.
  2. If all known house numbers in the same street, town and postcode are even, it is likely that most of the missing even numbers between the min and max of the known house numbers exist, too.
  3. As for (2) but for odd numbers.
  4. If both odd and even numbers can be found within the known house numbers in the same street, town and postcode, it is likely that most of the missing numbers between the the min and max of the known house numbers exist, too.

And this is the pseudocode built on the above to infer addresses:

for any 2 or more individual numeric PAOs or SAOs in the same street, town and postcode of non-inferred addresses
    strip away the non-numeric parts from the PAOs and SAOs (e.g. 7A becomes 7)
    if all PAOs and SAOs are odd then
        infer all addresses corresponding to all missing odd PAOs included within the min and max known PAOs or SAOs, excluding both boundaries (e.g. 7 is not being inferred) and the known house numbers
    else if all PAOs and SAOs are even then
        infer all addresses corresponding to all missing even PAOs included within the min and max known PAOs or SAOs, excluding both boundaries and the known house numbers
    else 
        infer all addresses corresponding to all missing PAOs included within the min and max known PAOs and SAOs, excluding both boundaries and the known house numbers
    end if
end for

E.g.

How to manage localities? If a locality is specified for two or more addresses used to generate inferred addresses, then all inferred addresses belong to that locality, too. Alternatively, the locality field is left empty in the inferred addresses.

Inferred addresses are conceptually re-generated at every distillation (this is not an implementation specification!), they are the "result of that distillation", but their URIs should be persistent. In other words, the rationale for inferring address x should be verified at every distillation.

If, during a later distillation, a non-inferred address from any of our sources matches a previously inferred address, it should "take its place" (its URI).

Tested vs the ~1m addresses in the 2014-12-10 edition of the OA dataset, this algorithm infers ~3.5m new addresses.

Note that, as a matter of principle, we're not inferring addresses from other inferred addresses at this stage.

Provenance should point at the inference code that was used and the addresses that were used as an input specifically to infer that address. Of course, provenance should point at the addresses' URI, not their value. Also because of this, it is useful to keep the inference code very readable and as "self-contained" as possible.

Ideally, the software module running inference as part of the distillation process should be strongly modular, allowing non-OA contributors to easily plug in alternative or additional inference algorithms.

pezholio commented 9 years ago

Just to be clear, are we running inference across ALL of the data, or are we only running it across newly added data? If it's newly added data, I'm assuming the flow goes something like this:

If we get ANOTHER address on that road, then I'm guessing we:

Is that right?

pezholio commented 9 years ago

Also, as per our discussion about making this into a web service, I was thinking the best solution might be a web service that accepts pre split addresses as JSON, for example:

curl -H "Content-Type: application/json" -d '{"address":{"saon": nil, "paon": 123, "street": "High Street", "locality": nil, "town":"Somewheretown", "postcode":"ABC 123"}}' http://inference-bot.openaddressesuk.org

If, for example, we find 131 High Street, the service will then return the following:

{
  "addresses": [
    {
      "saon": nil, 
      "paon": 125, 
      "street": "High Street", 
      "locality": nil, 
      "town":"Somewheretown", 
      "postcode":"ABC 123"
    },
    {
      "saon": nil, 
      "paon": 127, 
      "street": "High Street", 
      "locality": nil, 
      "town":"Somewheretown", 
      "postcode":"ABC 123"
    },
    {
      "saon": nil, 
      "paon": 129, 
      "street": "High Street", 
      "locality": nil, 
      "town":"Somewheretown", 
      "postcode":"ABC 123"
    },
  ]
}

Does that sound sensible?

JeniT commented 9 years ago
  1. I think most inference will run across the whole dataset initially, but then need to be triggered by new entries as they come in. We already have the ability to download the whole dataset, so the addition here would be something that provides a feed of new/added/changed addresses.
  2. It would make sense for most inference to happen on a street by street basis, so providing methods to a) get all known streets (including ones without addresses we know about) and b) list known addresses on that street would support both types.
  3. I don't think we should require structured addresses for this API: I think we want to be in control of address parsing and structure and the structured version would limit us.
JeniT commented 9 years ago

Oh hang on, you were talking about the API for the inference module; I was talking about for our service. I was imagining that the inference module would work independently, polling our service and sending addresses back to us. But you could do it the other way round, as a service that we ping every time there's a change, as you suggest.

I think in that case we should supply the data about all addresses in the street containing the changed address to the inference service.

(We can imagine in future that services would be able to subscribe to different types of web hooks eg to get whole localities at a time.)

giacecco commented 9 years ago
pezholio commented 9 years ago

Cool, I take the point about streets and postcodes. I think you're right, the addresses we use for inference must definitely have the same postcode.

We probably need to talk more about this during standup, but I don't think we'll be in a position to start on this today anyway.

pezholio commented 9 years ago

Also, just had a thought. Could we have an inference bot running as an ETL on Turbot?

Floppy commented 9 years ago

You mean turbot running a bot that does the inference job? I thought that was the plan. At one point anyway.

pezholio commented 9 years ago

It may well have been :smile: In that case my brilliantly original idea was accidentally stolen from the past.

giacecco commented 9 years ago

Postponed to sprint >=42, unfortunately. FYI @peterkwells.

pezholio commented 9 years ago

So, I have http://jess.openaddressesuk.org/. At the moment, this infers addresses when given a JSON representation of an address. Next thing to do is to return provenance when the URL of an address is posted. I'm thinking if we post address=https://alpha.openaddressesuk.org/addresses/rYwoGk for example we get back the response as with the json, but with a provenance section like so:

{
  "inferred": {
      ...  INFERRED ADDRESSES GO HERE ...
  },
  "existing": {
      ... EXISTING ADDRESSES GO HERE ...
  }
  "provenance": {
    "activity": {
      "executed_at": "2015-01-21T16:18:32+00:00",
      "processing_scripts": "https://github.com/OpenAddressesUK/jess",
      "derived_from": [
        {
          "type": "inference",
          "inferred_from": [
             "https://alpha.openaddressesuk.org/addresses/rYwoGk"
             ... URLs of existing addresses go here ...
           ],
           "inferred_at": "2015-01-21T16:18:32+00:00",
           "processing_script": "https://github.com/OpenAddressesUK/jess/blob/5d954baa0b91ed25c42fb060ad659ce68cdd2e45/lib/jess.rb"
      ]
  }
}
giacecco commented 9 years ago

As discussed today, the "time decay" component of the heuristic adjustments to the score will temporarily use as "age" of the data the time passed from the moment of ingestion. Later, it will be as described here.

pezholio commented 9 years ago

We can now infer from addresses in the database, just POST a token like so:

curl --data "token=rYwoGk" http://jess.openaddressesuk.org/infer"

And you'll get back a blob of JSON like so:

{
  "addresses": {
    "inferred": [
      {
        "saon": null,
        "paon": 6,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 7,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 8,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 9,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 10,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 11,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 12,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 13,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 14,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 15,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 16,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 17,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 18,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 19,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 20,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 21,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 22,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 23,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 24,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 25,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 26,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 27,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 28,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      },
      {
        "saon": null,
        "paon": 29,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE"
      }
    ],
    "existing": [
      {
        "saon": null,
        "paon": 30,
        "street": "FOGRALEA",
        "locality": null,
        "town": "SHETLAND",
        "postcode": "ZE1 0SE",
        "url": "http://alpha.openaddressesuk.org/address/1pyXsJ"
      }
    ]
  },
  "provenance": {
    "activity": {
      "executed_at": "2015-02-27T16:50:30.330+00:00",
      "processing_scripts": "https://github.com/OpenAddressesUK/jess",
      "derived_from": [
        {
          "type": "inference",
          "inferred_from": [
            "http://alpha.openaddressesuk.org/address/VDtibW",
            "http://alpha.openaddressesuk.org/address/1pyXsJ"
          ],
          "inferred_at": "2015-02-27T16:50:30.330+00:00",
          "processing_script": "https://github.com/OpenAddressesUK/jess/blob/535847cbe3cd9212a7743e0cf879f20faa50a114/lib/jess.rb"
        }
      ]
    }
  }
}

Note that this only infers 'upwards', as this was what was in @giacecco's original specs, but it can easily infer both ways, just delete https://github.com/OpenAddressesUK/jess/blob/master/lib/jess.rb#L28

JeniT commented 9 years ago

I don't get it: where does the token come from?

giacecco commented 9 years ago

I had suggested @pezholio to rename the argument to something like "seed" but yes, it is not self explanatory. That is an address identifier as in https://alpha.openaddressesuk.org/addresses/rYwoGk . It's a seed as by referring to an address you're telling the inference engine to investigate its street and postcode combination.

Floppy commented 9 years ago

In https://github.com/OpenAddressesUK/jess/pull/1 I've changed provenance for token lookups to use the ernest URLs, because that's what it actually derives from, and otherwise we end up with theodolite URLs in the core DB, which we don't want.

A next step on this is to move the inference querying to use the ernest DB directly instead of the published address file, but that needs confidence measures to be there and probably used in distiller.

Floppy commented 9 years ago

Currently working on the hello-kitty bot which will use jess to load inferred addresses into ernest. However, having an issue with the updated_since queries on theodolite as the index doesn't seem to be right.

giacecco commented 9 years ago

@james, re: https://github.com/theodi/shared/issues/504#issuecomment-77171002 , I get your point, you are right.

Just to be safe on the semantics: do ernest URLs today represent addresses "as they were ingested" rather than "as they were distilled"?

For consistency, though, why don't I find a references to the ernest URL in the distilled URL's provenance? E.g. https://alpha.openaddressesuk.org/addresses/UAdD1o.json is not referencing http://ernest.openaddressesuk.org/addresses/2935011 ?

pezholio commented 9 years ago

What's the blocker on this?

Floppy commented 9 years ago

As far as I can tell, the hello-kitty bot is ready to roll. Just need to let it run.