openaddresses / batch

OpenAddresses/Machine based AWS Batch based ETL Processing
https://batch.openaddresses.io/
MIT License
6 stars 5 forks source link

Document hash field #341

Open pnoll1 opened 1 year ago

pnoll1 commented 1 year ago

Is your feature request related to a problem? Please describe. The hash field is critical to data users and the only documentation is in a 5 year old issue thread and doesn't appear completely accurate. Hash doesn't stay the same even if the content does.

https://github.com/openaddresses/machine/issues/683 says "The hash value is calculated as a content hash, and it can be used to determine that two addresses are identical between different runs of a single source."

https://github.com/pelias/openaddresses/pull/442 says "It turns out that the existing HASH column generated by the OA team is seeded with a random number, so even if the underlying data remains the same, the hash value will change with each rebuild of the OA file."

Describe the solution you'd like Description of how hash created, gotchas and example use case.

Describe alternatives you've considered