minimalparts / PeARS

Archive repository for the PeARS project. Please head over to https://github.com/PeARSearch/PeARS-orchard for the latest version.
MIT License
17 stars 20 forks source link

Prevention from poisoning, forgery, and other attacks #42

Open dumblob opened 8 years ago

dumblob commented 8 years ago

Mainly based on the article http://aurelieherbelot.net/pears/ I suspect PeARS to be quite prone to e.g. poisoning of search results made across the distributed network.

Also forgery (e.g. "let's promote my web page") seems to be a big topic in this area of distributed "value offerings". Imagine that thanks to IoT anyone can be running thousands of virtual machines visiting only certain web pages to highly influence queries of others.

Do you have any plans to cope with security and attacks?

So far the solution seems totally open and therefore highly prone to basically any attack, which in turn makes it unusable in a real world :cry:. I really don't want anyone to be able to influence our society (think of e.g. politics) just by such trivial technical means.

minimalparts commented 8 years ago

Hi, thanks for the comment! Yes, this is definitely a problem we have to tackle. The way we intend to solve this is very much related to the basic architecture of the system. I'll try to explain the plan... feel free to give feedback :)

One way to build a distributed search engine is to make some nodes 'responsible' for some particular sites or topics, where the allocation is randomly decided. This ensures the coherence of each node, but it is indeed easy to create a new peer that will heavily advertise a particular product or opinion.

The way PeARS works is by having a node signature that reflects the browsing history of a particular (real) user -- or at least the parts of their history that they are happy to share with the rest of the network. Those node signatures are constructed in a way that -- at least in theory -- they can be checked for 'human-likeness'. So the first think we would try to prevent forgery is to build a simple classifier that checks out new nodes and send some kind of warning if they don't look like what you'd expect from a human contributor.

This strategy is not foolproof because you could be very clever and spread out your advertising material over thousands of nodes which look otherwise normal, but a) I think the objective would be partly defeated (by having the nasty material drowned in other things); b) additional checks can be performed at the network level to identify weird node distributions.

I don't think the system will ever be 100% secure from this point of view, but the plan is basically to make each user agent as savvy as a person would be in the real world (i.e. you might distrust someone who constantly talks about their own company and nothing else).

dumblob commented 8 years ago

Thank you for such a prompt answer!

The proposed classifier solution is still easily forgeable (also because the classifier will be public and therefore preparing optimal input shouldn't be that hard - it's just plain classifier training :)).

the plan is basically to make each user agent as savvy as a person would be in the real world (i.e. you might distrust someone who constantly talks about their own company and nothing else).

Ok, in this light the proposed classifier solution makes totally sense. On the other hand it's a pity, that PeARS will become a very good "replica" of mass media (which are strongly influenced by rich subjects all over the world), so it won't be useful for general searching, but rather "social" searching heavily focused on people's attitudes.

Generally the solution might be to make it e.g. like Wikipedia. I.e. having a minimal base of "trusted people" (e.g. based on personal knowledge or similar and at the same time based on technical abilities and equipment of those to ensure their systems can't be easily compromised) and then a ranking system of contributors with control of 3+ highest-ranked contributors chosen randomly. Or something similar - it's worth studying ideas behind https://www.ethereum.org/ (especially principles of applications build upon Ethereum on http://dapps.ethercasts.com/). Perhaps it would even make sense to use Ethereum (or any Ethereum-based application) as the basis for trust for PeARS.

minimalparts commented 8 years ago

Thanks for the interesting discussion! May I ask for some clarifications? (We don't currently have any security buff on the team, so please forgive me if I say anything stupid...)

The proposed classifier solution is still easily forgeable (also because the classifier will be public and therefore preparing optimal input shouldn't be that hard - it's just plain classifier training :)).

I'm not sure in which way it is that easily forgeable. Each node is basically represented as a 400-dimensions vector which is a compressed representation of the topics it holds. If the classifier is public, we know in which area of the space that vector should sit to 'look good', but making it look good involves having the right mix of documents on that node anyway. You can of course forge that vector from scratch but it won't match the document vectors behind it, so the system should send warnings at that point. (Or am I missing something fundamental about the way you would forge this stuff?)

PeARS will become a very good "replica" of mass media (which are strongly influenced by rich subjects all over the world)

Well, ideally, PeARS should be a replica of what individuals are interested in, and give them equal representation on the network. I agree that a transition process would be needed: if most people hang about on FB and only on FB right now, their nodes will reflect that. But I'm kind of relying on the fact that early adopters might be the people who use more than 3 sites in their daily life, and therefore would positively bias the network, i.e. 'new' people will be led to less corporate sites by the initial crowd and click on links on those sites. Perhaps I'm over-optimistic, but I think a lot of the problems related to the Web monoculture comes from search engines themselves. I.e. people click on the first few links in search results and always end up on the same sites. Having a better and unbiased (as in, not biased by corpora interests) search algorithm would counteract this.

Thanks for the ideas and the links, we'll check them out! I was also recently talking to a developer on the Solid project (https://solid.mit.edu/) and they are building decentralised tools to implement trust levels. We're just no quite there yet and haven't given enough considerations to this. But definitely food for thoughts!

(Btw, unrelated comment, but the 'official' repo for PeARS has now been moved to a new organisation: https://github.com/PeARSearch, so in case you'd like to bring up other issues/comments, could you do it there? Sorry this is a little confusing :-S)