Discovery of Users - Githubissues

srkunze commented 9 years ago

I could not find anything useful on this topic, thus this thread.

Recently, I introduced Tox to my girlfriend and sent her my ToxID. She try it out on her tablet. She was even more surprised regarding the long identifier as she discovered that she even need to scroll to the right to see it fully.

Long story short, she liked the encryption, group, voice and video features, but those IDs... she compared them to "useless ICQ numbers". I know technically that is not quite the same, but from what I know Tox aims to be as user-friendly as it can be.

User discovery is an important part here as well. Is there a way to query after user names? I know they are not unique but it would be a start and if there are two users with the same username, e.g. the last two digits of the ToxID can be used (and interchanged manually) in order to distinguish the buddy from other guys.

aaannndddyyy commented 9 years ago

@tttom : how can they get your number? it's hashed. that cannot be reverted.

I don't think free text and complicate parsing shoud take place. There should only be a few possiblities and those should be in a standardized format. e.g. numbers with or without country code yield totally different hashes. so you take your number, hash it and announce your tox id together with that hash. Addionally, if you want, you hash myNick or name,lastname,city and only/also announce your id with that hash. That's it. Anton Announcer hashes "anton,announcer," (he didn't provide a city) and announces the hash that gave with his tox id.

Susan Seeker looks for Anton Announcer by hashing his name or number (in the case of number she won't find anything) and querying for the associated record. She gets a tox id, send a friend request, which then is accepted.

Now she can start chatting. But since Susan is not sure if there are maybe 3 persons with that name, or if there's an impostor, she authenticates him (QR code she could do, but then she wouldn't need that lookup service), so she chosses to ask Anton a private question whose answer only he can know). For people you don't know, you'd rather use the DNS system of the organization that person is afiliated with. That would be a trusted party then.

srkunze commented 9 years ago

so you take your number, hash it and announce your tox id together with that hash.

Signing might be helpful, too, right?

tttom commented 9 years ago

Re:'how can they get your number? it's hashed. that cannot be reverted.'

They would use your number, email, or name to get your ToxID. They just has either and do a lookup.

I agree that parsing should be minimal and input standardized somehow, e.g. international phone number as I already suggested.

aaannndddyyy commented 9 years ago

@tttom my reply ws to your "spammers may get your phone number or e-mail easily unless you hash a password with it.". So, only those who already have your number, can also get your tox id, if you choose this method. If you chaose to be findable by name, they can find your tox id without knowing your phone number, but they still won't find your phone number that way.

aaannndddyyy commented 9 years ago

@srkunze Signing? Why? The receiver doesn't know the key. so can't verfiy. Unless you he knows your pgp key or so. then you could also publish your toxid linked to your public key, and sign the record, yes.

srkunze commented 9 years ago

ToxID != public key?

aaannndddyyy commented 9 years ago

You gain no security signing with a random tox public key your record. An impostor can sign his record too. Or simply another user with the same name. it's about which keys you trust.

srkunze commented 9 years ago

Yeah, sure. I mean, just in case some nodes might think to tinker with your search data. They could only remove it then.

aaannndddyyy commented 9 years ago

ah, yes. only owner of the toxid could then publish for that toxid. would be a good feature to have.

ArchangeGabriel commented 9 years ago

About phone number hashing, I must disagree: this is revertable, because establishing a rainbow table for a fixed format with only numbers is really easy. Adding a salt will decrease a lot the size of this issue (slowing down spam bots a lot), but it still remains (as a user who know only the name and location of someone else, I could still get the number). So, it seems to me that if you want to avoid this, the number must be the only information hashed, which means people have to know your number to find your ID. That, or the number should not be included.

So, I’m not sure what informations might be included in the hash fields. Would go for user name, first and last name, location, email.

But remember, this must be opt-out, I might not want my ToxID to be linked to my real identity in anyway.

aaannndddyyy commented 9 years ago

@ArchangeGabriel of course optional. I think i said that above. if not, then i only forgot to write it. but yes, just like current user@toxme.se is optional. maybe i dont want to be found. Or my id is an anon one. Re hashing, the way i described it means only the number gets hashed. Pople who have your number in their phone can then hash the number and find you. You dont want to be found by your number, then dont publish it. There should be standards. zou can either hash your number, or your nickname or your full name in a given format, or your full name and city. not necesarily all of that.

ArchangeGabriel commented 9 years ago

Yes, you said it. Just want to bring it one more time.

Maybe there is something I don’t understand about your hashing proposal. Suppose I want people that know my number to be able to found me using it. And same for say, my e-mail. And I want them to be able to find me even if they know only one of these two. But then, I don’t want those ones to be able to get the one they don’t know this way. Reversing a hashed email is hard, but a phone number for which you know the format is easy.

aaannndddyyy commented 9 years ago

you understand it correctly. if you provide all info, it would be possible to get the number. I had thought of it rather as alternatives. So you either provided your number OR your name OR address. So you either want to be found by number - like in whatsapp -, or by name - like in skype. if you want to publish both, in order to be findable by either, and want your number to remain unlinked to your name that would be hard to accomplish. It would be different records, and a reverse lookup of id / to number should not be possible. So the stalker cannot first lookup your id using your name and then lookup your phone number using your id. The one providing the lookup service could, however, correlate the two. But then again there are many people in public phone books too. And who does not want to be, simply does not publish that info.

ArchangeGabriel commented 9 years ago

OK, I agree with the XOR version.

Dirius77 commented 9 years ago

I think that if the values were stored as a hash then someone finding one value given the other wouldn't be that much of a problem, if they are stored in a hash then they would have to break the hash. This is also simply dealt with because we are just leaving out a feature, we would leave out the ability to search from a Tox ID, which isn't that important anyway.

The issue with phone numbers is that they are so small and follow a unique format, so hashing them still doesn't help much, we're making them longer without making them any more unique. This makes it easier for a spammer to guess your number and get your ID, which could be bad. The only solution I see to this would be to salt the phone number, or in some other way make it more unique, but that would remove the ability to find someone given their phone-number. I personally would suggest a username:number format, as that means that simply guessing a number won't get you the correct hash, they have to already know that the number belongs to YOU.

The issue lies in phone numbers themselves, without adding more data there isn't anything we can do to make them more complex. I don't think it is Tox's job to find a way to make a phone number unique, someone could just as easily make a bot to text and spam phone numbers as they could one to look up numbers on the Tox network. We deal with phone spammers and phishing schemes all the time, blocking someone on Tox because they guessed our number and we don't know them.

ArchangeGabriel commented 9 years ago

Indeed, I was editing my above post about the intrinsic weakness of phone number in this situation, because knowing the Tox ID of a person searchable by phone number, you will be able to know its phone number even if that was not intended (e.g. I want my friend to be able to find me with it, but I also gave my ToxID to other people that I don’t want them to have my phone number, still wanting to be able to Tox with them). However, that is kind of solved with multiple ToxID like it is or is going to (still catching up with my emails and notifications) be implemented in qTox.

aaannndddyyy commented 9 years ago

the reason why numbers would be cool is THE killer argument for things like whatsapp > you install it and have your friends already there. no need to aks them for their username and add them one by one. And numbers are unique already. Whereas unique does not mean authenticated. If spambots reallz become a problem, you can still change your nospam value, thus invalidating your phonenumber record.

aaannndddyyy commented 9 years ago

@ArchangeGabriel How do you get my number if you only know my tox id?

ArchangeGabriel commented 9 years ago

From what I understand, if you added your number (hashed or not, it’s the same) to a record linking it to your ToxID for people knowing your number to be able to find you, then I can get the record with you ToxID in it and get the number. Am I wrong?

aaannndddyyy commented 9 years ago

there is no reverse lookup. name -> id number -> id

not: id -> number

ArchangeGabriel commented 9 years ago

Is that possible? I mean, I don’t say that reverse lookup is implemented, but can’t you manually do this? I believe that if the association exists, you can do it in both ways.

aaannndddyyy commented 9 years ago

how? you you know a name, query the lookup server with the hash of that name and it returns and id. you query it with that id, it retuns nothing, or thinks it is a hash of something and maybe gives you a totally unrelated id back. still no number. You could now query ALL possible numbers of a given country - given that you know the country the person lives in. That is the only way i see.

aaannndddyyy commented 9 years ago

And again, who does not want his number published, does not publish it at all. Millions of people are in phone books and other millions are not.

srkunze commented 9 years ago

But remember, this must be opt-out, I might not want my ToxID to be linked to my real identity in anyway.

Not an opt-out. An opt-in!

srkunze commented 9 years ago

@ArchangeGabriel Nice you considered my issue "Multiple ToxIDs" here, too. I hope I can make Tox a bit better as I am personally really satisfied with it. Mainstream support is important; otherwise it makes no sense to use it.

srkunze commented 9 years ago

You could now query ALL possible numbers of a given country - given that you know the country the person lives in. That is the only way i see.

That is true. Well, I am no cryptographer but from this point of view, is hashing phone numbers sensible then? When trying out all phone numbers on earth, it basically makes hashing senseless.

However, I see value for other attributes.

srkunze commented 9 years ago

Would this help?

rehashing search attributes once in a while

ArchangeGabriel commented 9 years ago

I meant opt-in indeed, since you need to provide input anyway to enable this.

@aaannndddyyy When you say server, you mean DHT right? Else someone having access to the servers can do this easily. And it’s centralized. And if they are listing with home numbers (which are opt-out in France), they are no public ones for mobile numbers (at least here in France again). So the “WhatApp killer feature” still makes sense. But as you’ve said, I consider the exhaustive search a real threat. To get around that and @srkunze remarks, I think salt might help. Le 30 janv. 2015 17:26, "Sven R. Kunze" notifications@github.com a écrit :

Would this help?

rehashing search attributes once in a while

— Reply to this email directly or view it on GitHub https://github.com/irungentoo/toxcore/issues/1222#issuecomment-72227549.

aaannndddyyy commented 9 years ago

important is that it gets implemented. not whether opt-in or opt-out. but as it was mentioned now, here's what i think:

opt-in means off by default. if the user does not find the setting to enable it, and we wish that feautre for a whatsapp-like out-of-the=box experience wrt to discoverability, then we already failed.
opt-out means on by default, possibly without the user knowing it. If a user then finds out his number got automatically published by his mobile tox client, this might be a bad surprise for him.

Those defaults are in order to make things easier and spare the user some thinking. But we cannot take all decisions for the user. And this is a decision he should take consciously, and therefore should be prompted a question or checkbox right when he creates an account, just like Antox does already for registration with toxme

aaannndddyyy commented 9 years ago

@ArchangeGabriel current whatsapp users trust the servers with the numbers too. The hashing would be no real obstacle, but it is not expensive to hash them, so why not. At least it somewhat obscures things.

So if we had this scheme on a server, the user also either trust is, just like whatsapp, or he does not, then he does not publish it. DHT has the same issues, just that it is a bit more difficult. But not that much. simply get a node id in the vicinity of the hashes you want to harvest. Also I think tox devs don't want to store much into the DHT.

I don't understand how you can salt the numbers while keeping them discoverable. I'm all ears.

srkunze commented 9 years ago

And this is a decision he should take consciously, and therefore should be prompted a question or checkbox right when he creates an account, just like Antox does already for registration with toxme

Agreed.

srkunze commented 9 years ago

I don't understand how you can salt the numbers while keeping them discoverable. I'm all ears.

I am no expert in decentralized discovery. I guess the data will be stored on those nodes being in close proximity to the client which needs to be discovered. So, discovery works like routing to the client in question.

Or does every node hold all discovery data of the whole network? I doubt that.

Dirius77 commented 9 years ago

I'd assume that nodes would hold data the way the discovery does, and that when they couldn't find the answer within their own stores they go looking for it. And storing hashed emails or numbers shouldn't be that much burden. Say we're using a 512 hash. Well that means that each entry, say Email, Phone Number, Secret Phrase, Name. Is 64 byes of data. The ID is 32 bytes of data. That means that each entry on the node is taking up (64*4)+32 Or 288 bytes of data on the node. Say Tox becomes huge and suddenly each node needs to hold onto 500 people's data. That is still only 144kb of data. The logs from my chats are probably bigger than that by now. And this means it scales very nicely, 5000 people is only 1.44mb of data. 50000 is 14.4MB of data, at which point it is starting to become large, enough that searching through it might become a bit slow. But that is still a huge number for scaleability. And it also means dedicated nodes, if we had them, because I personally would like to set one up, will be able to server upwards of half a million clients without using more memory than the average flash drive can hold. I don't believe storing the data on the DHT is the issue.

On Friday, January 30, 2015, Sven R. Kunze notifications@github.com wrote:

I don't understand how you can salt the numbers while keeping them discoverable. I'm all ears.

I am no expert in decentralized discovery. I guess the data will be stored on those nodes being in close proximity to the client which needs to be discovered. So, discovery works like routing to the client in question.

Or does every node hold all discovery data of the whole network? I doubt that.

— Reply to this email directly or view it on GitHub https://github.com/irungentoo/toxcore/issues/1222#issuecomment-72283440.

aaannndddyyy commented 9 years ago

The question was about salting the numbers, not about storage

srkunze commented 9 years ago

@Dirius77 Good analysis, thanks.

srkunze commented 9 years ago

@aaannndddyyy Sure. However, the point is whether you know the salt or not.

If each node can provide all the data, clients can go through all the search data themselves and re-hash the known search data and compare it to the data given by the network.

If clients cannot get their hands on all the discovery data, the only way (at least from what I can see) is to use the same salt (known in advance like current date+hour) for all the data. As said before, I am no crypotographer, so I don't know whether that makes sense at all from a security point of view.

Dirius77 commented 9 years ago

To use a salt it would have to be something: a) Public. and b) NOT based on the phone number.

This removes a lot of possibility from the salt options, but it may be possible to have the salt be generated on a per node basis? Ex: Request a search. Node responds saying it will allow a search. ask for nodes salt code. Node sends you salt code. Hash data with the salt. Send it to the node. Node returns Tox id.

The issue with this is that the data should be stored on more than one node. Or if that node does not know the data, the salt is worthless and makes it unsearchable in the rest of the network.

aaannndddyyy commented 9 years ago

i don't get it. you want to provent the nodes that store the data from correlating a name-id entry from a phoneNumber-id entry. if that node decides on the salt the salt is worthless against this. if the salt is public, the node can get that public salt too. if the salt is secret, a user knowing only a number cannot find you, which entirely defies the case of having the number there. There is no way around that. Storing records in a distributed manner is a pretty good coutermeasure, but it is not 100% (c.f. sybil attack). That is why number records just like any other records are optional and users must be aware that the possiblity exitsts at least in theory to link a name to a number like an a classic telephone book, if both name and number records exist for that user.

srkunze commented 9 years ago

What is wrong with a public salt changing every hour?

How is discovery working anyway?

1) My node compiles search query "{phone: hash}". 2) My node asks other nodes for ToxIDs matching my search query. 3) .... now what? How does the network know which nodes to ask efficiently?

aaannndddyyy commented 9 years ago

A time-limited, public salt is still public. And the attacker could also just hash all possible number combinations for this hour, and next hour ... The salt is publically known. The attacker knows a name and wants a number. He looked up name-id record and has now the id. Once he has all the hashes, he checks with his sibyl nodes, checking all returned id's whether they match the id he already has. When he finds a matching record, he knows the number.

don't get me wrong, i'm not trying to dissuade you. Rather poiting out that there is no 100% solution. But since user decideds, and since this attack is not trivial, i think it is ok.

srkunze commented 9 years ago

Alright. :) I just thought that salting is better than nothing.

About my second question...

Dirius77 commented 9 years ago

I'd assume discovery would follow the way that it does when you add someone. Otherwise it could just be a broadcast request, you ask a node, if it doesn't know it either sends you a nodes list to ask, or asks them itself and relays the info back to you

On Sunday, February 1, 2015, Sven R. Kunze notifications@github.com wrote:

Alright. :) I just thought that salting is better than nothing.

About my second question...

— Reply to this email directly or view it on GitHub https://github.com/irungentoo/toxcore/issues/1222#issuecomment-72365067.

aaannndddyyy commented 9 years ago

standard DHT algorithms, there are many different DHT's out in the wild. The basic idea is a node has an id. you aks the node that is closest to the key that you are looking for and also you publish your keys to the nodes closest to that key. If the queried node does not know the answer it forwards the query to the closest node in keyspace that it knows. This is not taking into account tox' onion routing. just general DHT idea.

srkunze commented 9 years ago

Okay. So, how do we proceed from here?

tttom commented 9 years ago

I would strongly recommend using the DHT that is already there, maintaining two DHTs would just add even more overhead. In general I like the idea put forward in this thread, my only concern being the extra load on the DHT. Many, if not most, users connect via a smartphone to messenger applications. As @Dirius77 pointed out, the issue is not storage. Though certainly network traffic will go up. Every keyword/phone number/e-mail/ToxDNS listing will be like an additional ToxID on the DHT. So there will be perhaps a ten-fold increase in storage requests per node, on average, and I expect a large standard deviation too. Lookup requests will probably not increase that much though.

How to proceed? What about what I proposed last week? Search: "Could this be implemented as follows:"... So far I didn't see any arguments why this wouldn't work, but I am all ears. I even think that most of it can be implemented on the client side, though changes to the core would be required to limit the load and distribute it better over the DHT (as last week, I am referring network traffic, not storage). Those changes would probably also harden the DHT against potential abuse.

fcore117 commented 9 years ago

tttom: i agree i do not want more overhead to flood routers and consume and use mobile internet bandwith

srkunze commented 9 years ago

@fcore117 I agree.

@tttom Sounds interesting. I guess those numerous sets of hashed keys is for increasing performance, right? I would say, go for it. We can improve upon it later anyway. :)

aaannndddyyy commented 9 years ago

You would not be permanently querying the DHT. You wouldperiodically recheck. Those who you had not found before may have gotten tox now, the ones you have may have changed. Though the latter would be less likely if you can backup your profile and have multi device support. For reduced traffic you'd increase the time the dht keeps a record.

I am against the free text record. Free text means that it has no standard. Is it first your surname, then given name? Or the other way round? Comma-separated, with semicola, or with spaces? ... Just stick with few strictly defined formats. The username one being already very free. Basically this is free text, but people use it as username, so they know that changing the order results in a different name. If you do without hashing, you could even do a fuzzy search. But you are less hidden.

Biggest issue I see is that users would expect the returned id to be correct. But they cannot, as there are no checks. I can publish my tox id under your name and number. If I knew you well, I might even imitate your way of writing and preferred topics. The publishing otoh would be more continuous.

srkunze commented 9 years ago

@aaannndddyyy I agree. So, we could start with the following attributes:

username: .+
full name: .+
location: .+
phone: +[0-9]+
email: [a-zA-z0-9-.]+@[a-zA-z0-9-.]+

Hashing of the data is optional to enable public/private searches.

aaannndddyyy commented 9 years ago

what about

full name no place
full name and place

I may only know the name of old school friends but not their usernames. There may be many people with the same name, so location as a discriminator would be helpful. Otoh, do you know where all of your old pals live now?

irungentoo / toxcore

Discovery of Users #1222