Security discussion for requested legacy ID lookup in GlobalRegistry

sgoendoer commented 7 years ago

As requested by @pchainho in Lisbon, I started working on the implementation of the legacy ID lookup in the Global Registry as discussed with @rjflp. Anyhow, I found some security flaw in the system. Hence, I stopped the development for now until the issue is resolved.

Long story short: I would say we drop the feature of looking up users via alternative identifiers in the GReg. Looking up users via other identifiers is THE use-case for the discovery @ingofriese. Here is why:

Security of the dataset:

The assumption here is that an attacker controls at least one server in the Global Registry. If servers are only run by trusted parties, this is rather unlikely to happen, still this is the basic assumption for all of the following attacks. Having control over one of the nodes allows an attacker to effectively disable all security checks server side, allowing him to write any kind of data to the DHT, especially overwriting all existing data.

Dataset security: The integrity of the dataset is based on the combination of the key, the salt, and the GUID - and ultimately the digital signature. The GUID is used as the lookup key, meaning resolving a GUID gives you the dataset (with the public key and the salt).

Attack scenarios: -The attacker exchanges data in a dataset and overwrites the legitimate dataset with a fake one.

The dataset is digitally signed with the user's private key. Any change of the contents result in an invalid signature, which is easily to be detected client-side.

The attacker exchanges the keypair to create a valid signature.

With the new keypair, the attacker can create a valid signature for the altered dataset. The dataset appears valid. Anyhow, exchanging the key results in either a mismatch of GUID and key/salt or a new GUID. Such a mismatch is easily detected on client-side (assuming no further checks are performed by the GReg nodes). If the attacker updates the GUID to prevent an integrity violation, the attack is automatically deflected as requests to the "fake" dataset has to be requested by the new (fake) GUID.
The attacker tries to find a collision for the GUID he wants to attack

Here, the attacker has to find a collision for PBKDF#2 using SHA256 with 10000 iterations. Simply trying to brute-force the salt is very unlikely as the salt is limited in length (16 chars).
The attacker overwrites the dataset contents with fake data to prevent access to the dataset.

This resembles a DoS attack and only works under the assumption that all GReg Nodes overwrite existing data without any further checks.
The attacker knows an (outdated) version of the dataset and uses it to overwrite a newer version

As signature and GUID are valid, this looks fine at first. Anyhow, all datasets have a specified timeout, and also works only under the assumption that all GReg Nodes overwrite existing data without any further checks. Depending on the "lifetime" of a dataset as specified per "timeout", this attack is only able to block access to "new" data in the dataset, but does not allow the attacker to add or alter data.

Hence, actually taking over a dataset requires either knowledge of the private key or the ability to recreate a valid signature.

Based on this concept, we discussed adding lookup functionality for datasets via other identifiers, e.g. a telephone number. The basic design goes as follows:

A new type of dataset is introduced: the "reference"
A reference is very similar to the "normal" dataset: It is a JWT with a JSONObject as payload data. The reference-JWT is signed using the user's private key (from the regular dataset)
The reference-dataset comprises an ID (e.g. the telephone number or email-address), the GUID of the user, a timestamp of creation, and a timeout (which should be the same as the timeout of the regular dataset). I.e., the reference-dataset "points" to the regular dataset via the comprised GUID.
The (hashed) ID (telephone number) is used as the lookup key for a reference dataset.
Each user can have [0 ... n] reference datasets pointing to his regular dataset.
Optional: The regular dataset has a list of references, functioning as pointers back to the references.

This way, basically an extended version of the regular dataset is used, just in a "separated" way where a part of it (the reference-dataset) is stored under a different lookup key.

The lookup of a dataset via a reference-ID would look like this:

The requesting user sends a GET request to the GReg to a new endpoint, e.g. /ref/:id
The node retrieves the reference-dataset from the DHT and extracts the GUID.
The node retrieves the regular dataset for this GUID from the DHT.
The node verifies the signature and structure of the regular dataset
With the key from the regular dataset, the node verifies the signature of the reference-dataset
Optional: The node verifies that the regular dataset has the reference-ID in it's list of references.
The node returns the jwt of the regular dataset to the requesting user.

Reasons for this design:

If there is no digital signature, anyone can publish and overwrite any other reference-dataset.
As the reference-ID is not created by hashing the user's key, the key is not bound to the lookup key. An attacker could easily exchange the key and create valid signatures.
Alternatively - or additionally - the regular dataset stores a "link back" to the reference-dataset.

Anyhow, there is one MAJOR problem: Anyone can create reference-datasets for any reference-id - even with valid signatures. Also, overwriting reference-datasets without this being detectable is possible: Assume, Alice with GUIDA has a reference with ID RIDA to her dataset. Bob wants to "steal" her ID and creates a reference-dataset with the same ID RIDA. The reference-dataset is then signed with Bob's private key - or even the key from a made-up dataset. This is not detectable by checking the signature or the dataset's formats.

Possible work-around: Reference-datasets need to be signed by the original issuer of the id. E.g. the email provider in case it is an email address, or the carrier in case it is a telephone number. Anyhow, this would require all possible providers to "hand out" signatures for these IDs. Also, signatures from a ReThink trusted CA can be applied. But then, we would have a centralized service component again and we would need to come up with a credible idea how to verify that a user X who requests a reference-dataset to be signed actually owns that ID.

As of now, I don't see any "secure" ways to work around this. Are there any other ideas for realizing this feature without allowing attackers to take over the referencing datasets?

rjflp commented 7 years ago

@sgoendoer Two questions/comments about our post:

Having control over one of the nodes allows an attacker to effectively disable all security checks server side, allowing him to write any kind of data to the DHT, especially overwriting all existing data.

This is true for every data stored into the DHT, even GUID. We are fighting this in two ways:

Provide way to check that the retrieved data is valid (done for the GUID, a method is needed for the userID)
Limiting the participation in the DHT to trusted partners. He currently have two methods CA and Blockchain and evaluation both on planet lab.

However, an attacker can still cause a DoS by replacing a valid entry with garbage. At least for the values stored in his nodes. The good news is that such attacker would be discovered and could be banned from the DHT.

The attacker knows an (outdated) version of the dataset and uses it to overwrite a newer version

Is a timeout used or a timestamp? A timestamp would solve the issue as nodes would refuse to replace an entry with older data.

rjflp commented 7 years ago

I agree with @sgoendoer: I also see no way of doing this for "legacy" identifiers. However, after discussing this with @pchainho and @Ricardo-Chaves I think we may have a chance of doing this for rethink identities only.

What we need as a way of verifying the authenticity of the new entry (userId->GUID), the same way we currently can verify that an GUID->userIds@SPs entry is valid. The IdP can provide assistance there. Each rethink UserId will have an associated IdP.

IdPs provide the following service:

The user can ask the IdP to create an assertion. He gives the IdP a value.
Upon successful authentication, the IdP will hold the value and return a key (assertion)
Anyone can present that key (assertion) to the IdP and receive the value back from the IdP

What I propose is to put the following in the Global Registry DHT is: UserId -> (GUID, assertion)

So, for inserting an entry into the DHT, the user would:

Contact the IdP associated with the userId and create an assertion with the value of the GUID
Write the entry into the DHT: UserId -> (GUID, assertion) The DHT can check (see below) before accepting the write.

When someone else performs a read using a userId, it would get back the GUID and assertion. In order to check its validity, it would:

Contact the IdP associated with the userId, giving it the assertion
In return, the IdP will have to return the same GUID. If not, the entry is fake

In order for this to work, the IdP would have to keep the assertion for a long time. How long? Months, years? Does this even make sense?

For the Global Registry nodes to be able to talk to the IdPs, they need to be able to load the adequate protostub, thus require a runtime. Including the runtime in the current implementation of the Global Registry is probably too complex. An alternative would be to pair each node with a trusted runtime it could use for this verification.

jmcrom commented 7 years ago

odd things happen here: we began designing legacy ID lookup in GlobalRegistry and we end up with non- legacy ID lookup in GlobalRegistry

more seriously, both @sgoendoer workaround and @rjflp solution rely on the same principe: the id issuer provides something that help the GReg to verify these new datasets

another way could be: having these datasets signed by the user's private key (so attached to the GUID)

sbecot commented 7 years ago

Well, besides the discussion, I don't even see any justification on using the Greg instead of the discovery service. Would it be possible to have a valid use case?

For the Global Registry nodes to be able to talk to the IdPs, they need to be able to load the adequate protostub, thus require a runtime.

The IdPProxy is something that has been normally designed before (and out of) rethink and should not depend of the runtime. What you are talking about is the IdPStub provided in the catalogue, which is a wrapper of the primary concept. Anyway, I fully disagree of including a runtime in the Greg. Including such a dependency would increase complexity and go dramatically against the first requirement:
NF1: The reTHINK defined framework aims more agility in service development by reducing the dependencies possible between modules (loosely coupled architecture)

sgoendoer commented 7 years ago

Thank you @rjflp, @jmcrom, and @sbecot for your comments

This is true for every data stored into the DHT, even GUID.

True. The approach of a limitation to trusted nodes allows this.

Is a timeout used or a timestamp? A timestamp would solve the issue as nodes would refuse to replace an entry with older data.

A timestamp is used. The idea is/was that the nodes check wether the datasets already timed out and refuse to deliver/handle them if so.

IdPs provide the following service... [...] Write the entry into the DHT: UserId -> (GUID, assertion)

This is doable. Anyhow, this requires the IdP to provide services and interfaces of creating assertions as well as verifying them. The overall lookup protocol would then look like this:

Alice has Bobs UserID Bob@x.com
Alice resolves Bobs UserID via the GReg and gets a dataset comprising Bob's GUID and an assertion
Alice looks up Bob's IdP to verify the assertion
With the verified information, Alice resolves the GUID to Bobs dataset (With Bob's UserIDs etc)
Alice contacts (one of) Bob's CSP Domain Registries.
...

In this scenario, there's a question:

If Alice already knows Bob's UserID, why wouldn't she contact the CSPs DomainRegistry directly? This obviously only works if the UserID comprises the CSP's domain name.
If said UserID is not bound to a CSP domain - but is just using a domain name of an IdP - wouldn't it be MUCH more comfortable, easy, and simple if this IdP would just "hand out" the GUID directly? It's a trusted entity after all. Ultimately, if we do this already, why not use the IdPs to lookup the endpoints right away? Said IdPs would directly resolve the UserID to the CSP's domain. Then anyhow, we would not need the GReg anymore and abandoned domain-independent identifiers altogether.

rjflp commented 7 years ago

@sgoendoer

If Alice already knows Bob's UserID, why wouldn't she contact the CSPs DomainRegistry directly? This obviously only works if the UserID comprises the CSP's domain name.

The user is free to use the same IdP with different CSPs. If a UserId was only associated with an CSP the problem would not exist.

If said UserID is not bound to a CSP domain - but is just using a domain name of an IdP - wouldn't it be MUCH more comfortable, easy, and simple if this IdP would just "hand out" the GUID directly? It's a trusted entity after all. Ultimately, if we do this already, why not use the IdPs to lookup the endpoints right away? Said IdPs would directly resolve the UserID to the CSP's domain. Then anyhow, we would not need the GReg anymore and abandoned domain-independent identifiers altogether.

I guess we are using IdP technology that is already standardized and can't be changed.

@pchainho can you please provide us with the motivation/use cases for this?

sgoendoer commented 7 years ago

The user is free to use the same IdP with different CSPs. If a UserId was only associated with an CSP the problem would not exist.

True. Anyhow, if we wanted to build the lookup of users in ReThink around IdP-IDs, we would have / should have created something different than a DHT-based registry. Then, we probably would have/should have specified that the IdPs provide some kind of directory service interface pointing to either the domain registry or BE the domain registry altogether.

@pchainho A usecase why we need this and in which way and scenarios this would be used would be helpful

pchainho commented 7 years ago

In terms of use cases, I think we have already discussed in several issues (eg this) but in general they are the ones implemented in our scenarios and demos, driven by WP1 use cases.

sgoendoer commented 7 years ago

Yes, but this is THE use-case for the discovery, isn't it?

sbecot commented 7 years ago

I agree with @sgoendoer, there is no justification in here for your demand, and more, @rjflp comment is in contradiction. The discovery service can allow to retreive a GUID, this GUID can be used with the Global Registry to find the domains where to reach the user, then you can use the domain registry to reach him/her. It's 3 steps, no need for a direct search in the global registry with something else.

pchainho commented 7 years ago

yes, this is discovery that only requires the usage of the Global registry. I just want to discover hyperties per user Identifier managed by an IdP.

I don't think T-Labs discovery service was designed for this. I understand it more like a Search Engine (similar to Google search or Microsoft Bing) to be used by humans that anyone can implement and provide as a Service / App on top of reTHINK framework.

Why performing 3 steps if 2 are enough?

rjflp commented 7 years ago

@sbecot I don't think my comment is contradictory. It is precisely because I see the Discovery Service much like a human usable search engine, just like google or bing or yahoo, that I concede that if we indeed need a service, usable by the runtime, for providing a UserId->GUID mapping, the Global Registry could be a valid option.

I've always seen the Discovery Service like a place were companies could innovate and compete, e.g. by combining data with other sources (such as Facebook, webpages, etc) not restricted to rethink. Binding the Discovery to a service to the runtime will hinder its innovation.

Another issue is: which Discovery Service to use? In my view, the one being build in the project is not "the" Discovery, but "a" discovery service. Hopefully, there will be others. Each will have its own view of the world a be able to provide different results.

Let me provide an analogy: to access another service, software uses DNS (registry). Humans use search engines (discovery).

sbecot commented 7 years ago

@rjflp I agree with the view of the Discovery service, but nothing prevent to use it with automatic searches. @rjflp : is it possible to lookup for a name of the contractor or an IP in the DNS? You usually lookup the DNS with the fqdn. @pchainho: You are performing 3 steps only the first time, if/when you don't have the GUID. Then you have it and can only use 2 steps. It is the same when you use the public directory to look for a phone number and including it in your own directory.

pchainho commented 7 years ago

By my "own directory" you mean the Graph Connector?

I was not thinking about having all people / things I communicate to, to be stored in the Graph Connector.

On the other hand, as soon as I have retrieved from the Global Registry the GUID and associated dataset. I can directly perform a query to the Domain Registry. Ie I would have 2 steps for the 1st time and then just 1 step.

rjflp commented 7 years ago

@sbecot You can perform a DNS search for an IP address, which returns the canonical name of the host. It is called a reverse DNS search. You can do it with dig, e.g. "dig -x 185.63.192.20". It is implemented the same way as regular DNS names, be using a special TLD (.in-addr.arpa) and reversing the IP octet order. E.g. 185.63.192.20 is represented as 20.192.63.185.in-addr.arpa.

What do you mean by contractor?

jmcrom commented 7 years ago

@pchainho 3 cases for Alice to call Bob:

Alice does not know much about Bob but his name and some assocciated elements (city, work, etc.): she uses the Discovery service to get the GUID, then the Global Registry to get a list of CSPs and finally a CSP to get contact endpoints.
Alice has already tried to reach Bob and she has his (permanent) GUID (e.g. stored in address book or call log, got from Bob thru some barcode): she only has to go to the Global Registry then to the CSP.
Alice knows which CSP to contact (e.g. a rendezvouz has been scheduled): she only needs to go to the CSP.

so sometimes 1 step is enough, sometimes she needs 2, sometimes 3

ingofriese commented 7 years ago

To me the ID lookup is not something that belongs to the Global Registry The whole process encompasses 3 modules:

Domain registry for listing of all current Hyperty URLs of a certain uID in this domain at this time
Global Registry is a one to many mapping. But one-way....for good reasons. (If a user has 2 GUIDs ...one he wants to keep private and ...one public. When both entries have a entry sip: abc ...then both of its GUIDS could be seen if also a backward mapping userID to GUID were possible) That’s a privacy breach.
Discovery Service is an any to any mapping but under fine-grained user control. This is the place where to put context, personal info, geographic information and also other identity information. Here the use-case "I have sip address and I want to contact this guy with reTHINK framework" is possible. Backward resolving uiserID to GUID is here possible if the user allows for. @Ricardo-Chaves This is the difference btw gReg an Discover (one to many - one-way vs. many to many -one-way or not is defined by the user)

@pchainho The use-case you want to implement can be done already with the services we have in place What I don't understand is your argumentation that discovery service is not really a reTHINK service because some else can build the same and make business out of it. Correct me if I'm wrong but everybody can also setup eg. DHTs and build GUIDs. I don't understand why you give discovery service another status than other services. I give you an example why all 3 steps have the same "core-level" I can call a hyperty just by knowing the Hyperty URL. That’s it. Someone else thus can implement something similar like Domain Registry, Global Registry and Discovery. Of course Domain Registry is build in. But no-one is obliged to use it. Maybe he can build e.g a more sophisticated or optimized process. So to me all 3 resolving/discovery steps a complentary building blocks. And depending on the use-case you need 1, 2 or three of them (see @jmcrom examples)

sbecot commented 7 years ago

@rjflp true, but the dig is another service and is not the nominal case. More: if you do a reverse search you can find a different answer (fqdn1->IP->fqdn2). By contractor I meant the one who subscribed to the DNS entry. Anyway we derive from the original case, and I think the case have been solved during the WP4 call.

pchainho commented 7 years ago

@jmcrom

so sometimes 1 step is enough, sometimes she needs 2, sometimes 3

True, but in your examples, you don't have the original use case where discovery is performed for a certain IdP identifier selected by the user when Hyperties are loaded. Also, as soon as you have retrieved for the 1st time the GUID and associated dataset from the Global Reg, next time I can directly perform a query to the Domain Registries ie only one step is needed not two. In case there are Domains with no entries for the user, this means your local registry data is outdated and you have to query the Global Registry again but I don't think this will happen very often ie user changing domains.

@ingofriese

What I don't understand is your argumentation that discovery service is not really a reTHINK service because some else can build the same and make business out of it. Correct me if I'm wrong but everybody can also setup eg. DHTs and build GUIDs. I don't understand why you give discovery service another status than other services.

Sorry, perhaps I didn't express well. What I've tried to say is that basic global discovery queries done with IdP Identifiers used by reTHINK users should be very performant and the Global Discovery should be enough. I like very much "your" discovery service and I think it provides lots of added value on top of the core functionalities to provide more advanced search queries that could even use AI for natural language queries. But this kind of service can be very different, including the user experience, depending on its provider, the market you want to target, the innovation and skills you apply on it, etc. For example, the user profile model can be very different using different types of info or media about the user. At the end, User Profiles can't be standardised. I also think that we don't have to force the user to create a user profile in order to be able to user reTHINK services. On the other hand, Domain Registry and Global Registry data model have to be standardised and there is not much room for differentiation in terms of functionalities (perhaps more in terms of non-functionalities). At this point, I don't see a business stakeholder making money from providing Global Registry services but something more similar to ICANN for DNS ie independent operators that have no commercial goals.

Regarding, user privacy, I think this is well addressed having different Identities from different IDPs together with policies, all under full control of the user.

reTHINK-project / dev-registry-global

Security discussion for requested legacy ID lookup in GlobalRegistry #22