sharedstreets / sharedstreets-ref-system

Making maps connectable: stable, non-proprietary IDs and data standards for streets
MIT License
186 stars 17 forks source link

reversibility of MD5 hashes in SharedStreets IDs #25

Closed kpwebb closed 5 years ago

kpwebb commented 5 years ago

A city has raised concerns about privacy issues (see quoted excerpt below) that they view as inherent to the use of the SharedStreets referencing system. The critique questions whether SharedStreets reference data is cryptographically secure, and that SharedStreets referencing system “creates an illusion of privacy” due to how SharedStreets reference IDs are generated.

This misunderstands how the referencing system is used, and conflates the concept of the referencing system with the data aggregation processes we use to anonymize sensitive data, like citizen-generated trip information.

The referencing system is a way to refer to a street segment using a common, map-independent language. It builds from already-public map data, derived from OpenStreetMap or other public GIS data sources. The reference IDs and the data they’re derived from are intentionally public, and there is no attempt made by SharedStreets to encrypt or obfuscate underlying map data, by design.

The transparency of the referencing system makes our aggregation methodology possible. We group data using public street references to allow individual behavior to be combined to provide aggregate insight into travel behavior and use of street space. The reversibility of a street reference is intentional, and has no relationship to the reversibility of aggregated movement data.

Screen Shot 2019-06-19 at 2 45 57 PM Screen Shot 2019-06-19 at 2 59 33 PM

However, this critique raises an important, broader point about the philosophy behind SharedStreets. We don’t view cryptography or any other data security solution as an appropriate way of protecting sensitive citizen-generated data. Breaches happen, quite often in spite of employing state-of-the-art cryptographic methods. Further, cryptography does not protect data against intentional internal misuse, or disclosure as compelled by law enforcement.

To overcome these limits our work instead focuses on data minimization. By not keeping sensitive citizen-generated information, and translating data into useful, non-reversible aggregates the harm resulting from a data breach or compelled disclosure is significantly reduced. There’s no way you can lose control of data you don’t have. This approach is widely considered to be the best and most viable privacy strategy. The goal of our work is to help cities make the shift away from data security, toward data minimization as a privacy strategy.

In summary:

Two specific technical points raised are addressed here:

“First, MD5 hashing is considered "cryptographically broken and unsuitable for further use. Moreover, the US National Institute of Standards and Technology (“NIST”) never felt it was secure enough for adoption, with MD5 being considered "nearly broken" back in 1995 and fully broken by 2004.”

The referenced NIST recommendations are applicable to MD5 in the context of encryption. We are using MD5 for public street referencing and hashing existing, already-public data, not for the encryption of sensitive data.

A point not raised here that is relevant to the efficacy of using MD5 for SharedStreets reference IDs is the potential for hash collisions (i.e. two different streets generating identical IDs). MD5 generates a 128bit hash, which is smaller than more modern cryptographic hash methods. As a matter of probability the smaller the resulting hash the more likely that two different inputs can result in identical hashes (further, the weakness in the MD5 algorithm slightly increase the likelihood of a collision). However, that possibility is exceeding unlikely and would require many quadrillions of street segments before two accidentally overlapping IDs were generated.

We considered both hash collisions and cryptographic weakness carefully when designing the SharedStreets referencing system, and compared a variety hash algorithms. (In fact we’ve put quite a bit of research into hashing algorithms that can be more easily reversed, to further improve on the goal of building truly interoperable map data.) Given that cryptographic concerns don’t apply to our work and the very low risk of duplicate IDs, we found MD5 to be a appropriate and widely available option for this application.

“Second, rounding to 5 decimal places means that locations are rounded to about 1 meter grids.”

The decimal rounding used in the generation of SharedStreets References is a design feature. Latitude and longitude are rounded to generate stable IDs across small changes in underlying geometry and to overcome floating point precision issues. One meter resolution is more than sufficient to uniquely identify street segments and intersections.

Quoted excerpt from the memo we received:

“The SharedStreets algorithm for generating references consists of the following two activities: Create the string in the form of "Intersection XX.XXXXX YY.YYYYY" with latitude and longitude rounded to 5 decimal places;1 and Generate an MD5 hash of the string.

This algorithm approach creates an illusion of privacy, but has two significant technical issues.

First, MD5 hashing is considered "cryptographically broken and unsuitable for further use." Moreover, the US National Institute of Standards and Technology (“NIST”) never felt it was secure enough for adoption, with MD5 being considered "nearly broken" back in 1995 and fully broken by 2004.

Second, rounding to 5 decimal places means that locations are rounded to about 1 meter grids. The XXXXX metropolitan area has a total area of 1X,XXX km2 meaning there are about 12.5 billion unique, hashable 1 meter locations. Since the MD5 sum and SharedStreets reference ID are both 128 bits of data, if every single unique, hashable location of the XXXX metropolitan area was pre-computed, the total dataset would be less than 200GB, allowing for an easy brute-force attack even if a stronger hashing algorithm were used.“

migurski commented 5 years ago

Thanks for publishing this note and your response; I agree with your observation that cryptographic security is not a meaningful concern for ShSt features and references. We’re just looking for a deterministic hash with low probability of collision.

kpwebb commented 5 years ago

@migurski thanks! And as always I really appreciate your help thinking through these issues.

One thing not addressed above, future versions of SharedStreets referencing system can use different hashing methods. I'm still hoping that someone comes up with a localized hash algorithm that works on 2+ dimensional data. As we discussed early on in the design phase, we considered geohashes precisely because of their reversibility and sortability, but didn't think they offered any real value with highly-dimensional data like the References. The methods we're aware of work for intersections, but that's just a point. A reference is 2+ points and vector and can't be reduced to useful terms via a hash.

If folks know of better techniques we'll gladly update the referencing system to include them!

johnclary commented 4 years ago

yes, it's always been quite obvious from the project docs that MD5 is used merely to generate reference ID's.

thanks @kpwebb for the explainer on decimal rounding. i was wondering how minor edits to source data were handled and probably missed it in the docs.