koush / PushSms

80 stars 17 forks source link

Potential side information risks #2

Open pulser opened 11 years ago

pulser commented 11 years ago

I was reviewing the implementation so far, and while I'd agree with the other comment RE using OTR rather than "rolled" RSA-2048, that's an aside.

This concern may not be valid or relevant depending on the use-case, but I will put it on the record anyway in case it's worth consideration, given the large and diverse userbase CM has.

Currently, it is possible using this system to determine who uses the "PushSMS" secure system from their mobile number. This is similar to the existing PGP/GPG systems and other implementations, based on email address.

My proposed concern is that the presence of a user's number on the cmmessaging appspot would act as side information that the user was using such encrypted services. This may not be a safe admission to make in some territories.

As it stands, I can use PGP email without my keys being listed on any key-server, and therefore a government or other organisation (say an employer) cannot tell that I use PGP. If they wanted to determine that, they'd need to find my email address, and then snoop on each and every mail going in or out, to see what's happening. Sure, that is not as convenient for someone wanting to send me a secure message, but I can give them my (countersigned) key in person or through another means, or via a trusted third party, where the counter-signatures verify it's genuinely mine.

Imagine the situation in an oppressive nation where a user of this kind of encrypted service may face consequences simply through advertising their use of it. All telecoms providers know the phone numbers they have issued, and thus could compile a list of all users making use of this service, and hand it over to a totalitarian state or surveillance body, marking these users out as "trouble".

With a regular PGP/GPG/OTR style implementation, this isn't possible short of monitoring all traffic to find the encrypted traffic, and profiling it. This system though has a central location which is aware of the telephone number of every user making use of the service, as well as a token tying it to a Google account (from what I can see), which would be of considerable value to a surveillance operative.

Since this is hosted on appspot (a Google service), is this data being held securely? While it's only public keys being held there, I suggest many users may not be comfortable with the data being held by Google, particularly in light of the recent events which led to this situation (as stated in the original G+ post). Given the dependency on C2DM/GCS for transmission, perhaps another suitable project would be an "API compatible" open-source, non-dependent cloud-to-device implementation that doesn't rely on any external third parties.

A bit of "legal pressure" could easily see this centralised service disabled by the big G, if it was making strong encryption too readily available to people.

tl;dr: this is a great idea and badly needed. I suggest there may be concerns given the central storage of phone numbers and google tokens, given the involvement of Google in assisting surveillance, as this could produce lists of users which, with cooperation of telecoms providers (via secret court proceedings) would identify the users making use of such services. I suggest this, while not a new concern, may be a concern in something that would be so popular and available to so many million users in different countries with different political situations

koush commented 11 years ago

@pulser It's not possible to determine who is using the system from just their number. You must have their number and email. This is actually how I got around the number verification issue. The user verifies their email, and then "claims" they own a number.

For example, suppose my email is foo@gmail.com. I log into the server and "claim" the number 555-1234567 (which isn't even a real number). The server hashes my email and stores it alongside the number I claim to have. When you you try to send a message to 555-1234567, it looks in your address book for that number. It finds all emails associated with that number, and then hashes the emails. It sends the email hashes up and number up to the server and asks for a registration id. If the server finds a matching email-hash/number pair, it will return the registration for that contact to you. If you don't have a contact with the number 555-1234567 and email foo@gmail.com in your address book, the server won't be able to find your information.

Basically, you need to know an email and the phone number of a contact you wish to find.

Granted, your bring up a fair point on the numbers themselves (if the server was compromised) to see if a user was using secure messaging.

To that end, I'm already investigating ways to embed cleartext onto the end of normal text messages to silently notify peers that they are capable of secure messaging. Similar to OTR. Problem is, that if the key or registration exchange happens over SMS, that's certainly all being logged somewhere, so there is still proof. But yeah, I'm looking at ways to remove the server component. I'm not a huge fan of that.

pulser commented 11 years ago

That's actually a very interesting concept, and I see a possible way to move on from there to avoid the issue.

To avoid the server component, have you considered the use of a distributed hash table (DHT)? I'm envisaging a modified DHT, whereby the user combines his number and email (as you do in your example), and store it alongside a key.

The difference though here is that we do not need to actually know it in plaintext. To continue my example, say you wanted to add me, bar@gmail.com, phone +1-555-555-5678, you'd take these two, and combine them together using a cryptographically secure hash (ie. SHA-256 or something similar).

Since anyone can read the DHT, the data is now "public". But do you fancy trying to brute force possible email addresses for the user, based on their phone number? (even with a list of all valid phone numbers for a country, you're still guessing at emails fairly heavily). Sure, if you are the telecoms company, you may have my email address from packet inspection, or from registering for online billing, but at that point, you now have ALL the information which we can possibly use to realistically mark users.

The DHT would store the user's crypto-hash as the identifier (you can look me up by repeating the hash, since you know the data). The DHT in itself though, to an outside inspector, contains just a pile of SHA-256 hashes, and associated public keys. It wouldn't be too onerous to also fill the DHT up with some "seed" data routinely, just to put a little bit of "noise" into it. It wouldn't ever affect an actual client, as they wouldn't ever request one of these hashes, but it would mean that there were a large number of "non-valid" entries within the hash table, just to throw off an attacker.

If you want to protect the DHT entries from future modification, you could require the public key that's uploaded to be signed by a private key, derived deterministically from the user's device configuration, plus their own password. This is not a "fully worked out" solution - how do we handle a user who loses his old phone and moves to a new one? Perhaps a deterministic algorithm using simply user provided data to generate the key? This would just prevent an unauthorised party from editing a DHT entry, since the signature wouldn't match the old one, and nobody on the network would allow the update or accept it.

RE the OTR part, indeed the issue here is that OTR requires a diffe-hellman exchange to occur, meaning both clients must be present at once. That's the tricky part, they must both be online at the same time to do it.

koush commented 11 years ago

From my G+ comment:

Also, its funny that you mention DHT. I JUST built a DHT client for Android a month ago, and was musing with
@cyanogen yesterday, that it could be used for the basis of peer to peer lookup. (In fact, the bencoding classes
I use in PushSMS are pulled from said torrent client :)

There's a lot of problems that arise, mostly that numbers aren't necessarily canonical.

5551234567 vs 1-555-123-5678 vs 1-555-1235678 etc

Furthermore, gcm registration ids are mutable. I think the idea is good, but it may be overengineered. DHT also has the downside that building the peers table takes a minute or so. And you need to stay connected to the swarm.

koush commented 11 years ago

Gonna run into the office now, just brain dumped my basic thoughts a bit haphazardly there.

pulser commented 11 years ago

Sure. Indeed, there are issues with the display of numbers, although I believe there may well be a standardised form. I'd suggest to do it "right", numbers should always be taken in international format. Though with email addresses, this is less of an issue. Though personally I'd argue that if you're dealing with numbers being non canonical, you'll have the same issue in plaintext - the service or client needs to remove hyphens and convert to international form (if that's desired) anyway surely?

Indeed, the GCM IDs being mutable could be a concern. I was considering a DHT only as a way to securely get keys tbh. This would add an element of trust as no one party could ever change a key, unless he/she knew the "deterministic" key that was used to sign the original key.

We may well be entering over-engineered territory here, I guess we need to perhaps pick a "worst case" scenario that this solution aims to initially solve, and expand from there. In the first instance, that may well be simpler, where we can trust the central server...

RE the DHT, it could be an open server grid anyone can join by "connecting up", and signing up for it. Then we give the client a list of the servers available, hardcoded into the app. There is no reason the actual client needs to connect to the DHT, if the user is willing to trust the pool of servers that are on the DHT - they can query 3 of the 30 servers, or whatever. That gives them a reasonable degree of certainty that it's unmodified and authentic. And it means the client doesn't have to join the DHT (unless they really want to, and are paranoid), but retains the decentralised system :)