pelias / openaddresses

Pelias import pipeline for OpenAddresses.
MIT License
51 stars 43 forks source link

addendum: add AU G-NAF PID values to addendum where available #490

Closed missinglink closed 2 years ago

missinglink commented 2 years ago

@orangejulius as discussed yesterday, this is a basic implementation which would allow us to record G-NAF PIDs

missinglink commented 2 years ago

updated regex via rebase:

diff --git a/lib/streams/documentStream.js b/lib/streams/documentStream.js
index 8c0229a..70050ee 100644
--- a/lib/streams/documentStream.js
+++ b/lib/streams/documentStream.js
@@ -1,6 +1,8 @@
 const through = require( 'through2' );
 const peliasModel = require( 'pelias-model' );
-const GNAF_PID_PATTERN = /^(GA)(NSW|VIC|QLD|SA|WA|TAS|NT|ACT|OT)(\d+)$/; // example: GAACT718519668
+
+// examples: GAACT718519668, GASA_424005553
+const GNAF_PID_PATTERN = /^(GA)(NSW|VIC|QLD|SA_|WA_|TAS|NT_|ACT|OT_)([0-9]{9})$/;

 /*
  * Create a stream of Documents from valid, cleaned CSV records
orangejulius commented 2 years ago

Cool, this is going to be really good.

I took a look at how many records are taking advantage of "index time deduping" (records that have the same calculated hash and thus end up as only one document in Elasticsearch) by running a build for all of Australia.

It looks like the value is pretty low: using all the statewide and the countrywide CSV files from OpenAddresses, the total number of addresses came to 27 million, but there were only 175,000 deleted documents (I think these essentially count the number of times an Elasticsearch document is updated, as we don't issue any actual delete calls while generating an index). The countrywide file itself, which comes from the G-NAF dataset, has about 15 million addresses. Here's what it looks like:

INDICES
health status index                              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   pelias-australia-2021.10.12-154316 k1aRtFfRTIWtaSBjPuN8xA   3   0   27369952       175286     10.8gb         10.8gb

With that in mind I do think it would be really cool to have the GNAF PID be used as the id component of the Pelias GID, since this would allow lookups via the /v1/place endpoint by GID.

orangejulius commented 2 years ago

Another thought. We might want to do something about query time deduping as well. Since it looks like a build with all the OpenAddresses data for Australia would only be about 50% G-NAF data, we'll probably need a way to ensure the API deduplication logic prefers G-NAF records.

I'm assuming that today address results for Australia today will end up being a relatively even split of G-NAF and non-G-NAF addresses, since I don't think there's anything overall that the API query time deduplication would use to prefer one data source or the other.

So maybe we keep the addendum concordance info and then allow the query time deduplication to consider that field?

missinglink commented 2 years ago

I'm currently :-1: on using the G-NAF PID as the GID for several reasons which I'm happy to discuss on another thread dedicated to discussing that idea of using upstream source IDs in place of amalgamation IDs.

This functionality (specific to recording the concordance) should be able to get merged independently of that discussion?

missinglink commented 2 years ago

Hmm.. actually this PR should probably be accompanied by some unit tests.

Ideally we would break out this functionality into a dedicated 'addendum mapper' as per other repos which makes it easier to test.

missinglink commented 2 years ago

added unit tests for this feature.