mlpoll / machinematch

Machine learning algorithm to connect anonymous accounts to real names
97 stars 0 forks source link

Should this even be released? Please vote. #1

Open mlpoll opened 7 years ago

mlpoll commented 7 years ago

This repository is currently without any code for a reason. Please comment and vote with thumbs up/down.

What is it?

Anonymous text series in (such as a Reddit or Amazon (review) account), authors real name out.

MachineMatch utilizes deep learning techniques to analyze blog posts, articles, papers and comments where the identity is known.

The same analysis is used on anonymous posts, such as Reddit comments, (fake) Amazon reviews and anonymous blog posts.

The resulting text analysis is then used to identify who wrote the anonymous post. The principle is similar to that of identifying plagiarism, but with more advanced deep learning techniques.

How good is it?

With a well trained network, the accuracy is remarkable, > 95% on my test input. Even when people write a bit differently when posting anonymously, the matching is very accurate if enough text is provided (esp. longtime redditors leak enough information about themselves to make manual verification quite easy!)

However

MachineMatch, if ever released, should not be used to expose people. Even though it's accurate, it's not 100%. I'm not even sure there are any ethical use cases. One exception might be to identify fake reviews, which is a nasty business these days. Feedback appreciated.

And, bah, why are you posting this anonymously? MachineMatch will have you!

Actually no, because I have no online presence using my real name. At all.

Why release

It's just a matter of time before someone releases such a tool anyway. Law enforcement probably possess such tools already.

It's not that hard to do and it's potentially a very popular service. By being first, you can at least try to set the standard in terms of ethics.

The very existence of such a tool may also discourage trolls and encourage people to think twice before posting fake reviews and libel.

Why not release

The flip side is of course that anonymity is a democratic tool.

I hope, given the existence of such a tool, someone makes an "identity obfuscation tool" to rephrase, depunctuate, reword and remove identifying information enough to significantly reduce the accuracy of matching tools. The optimal scenario is someone implementing such a tool before releasing MachineMatch.

This is actually a very hard problem as people make a lot of posts under the same anonymous account handle. They're bound to leak information, such as areas of interest, location/weather, events happening around you, time zone, daily routines, recurring spelling and grammar mistakes, family matters, and the list goes on and on. Digital fingerprinting is easy.

Reubend commented 7 years ago

I think that you should release it, and my reasoning is based on this: the code itself is not as important as the idea, which you have already published. The concept of identifying an anonymous writer using deep learning is both frightening and fascinating, and I'm glad that you posted this. However, a malicious person with enough time could easily build an alternative implementation given the description you posted here.

krisnova commented 7 years ago

I think this should be released. It will encourage engineers to be better if they seek true anonymity or to seek our their own algorithms for improving their product. Also, as an avid open source software engineer I think something like this could be valuable to community.

Bottom line : Always open source potentially dangerous code. It makes the internet better.

michaeljs1990 commented 7 years ago

all moral issues aside if you don't release it now someone else will in the future. It would also be a good way to improve security for people who want to stay anonymous since they can check if something they posted secretly tracks back to them.

rbong commented 7 years ago

YES, this should be released. The sooner it is released, the sooner people can start finding ways to protect themselves.

leon-wbr commented 7 years ago

Could this be turned around to seek out the parts of your text which are identifiable?

mlpoll commented 7 years ago

Could this be turned around to seek out the parts of your text which are identifiable?

This is a very insightful question. It's not trivial. Part of how the machine learning algorithm works is to leave it to the network to figure out the metrics. The problem is similar to the question "which metric does the Facebook trained network use to identify faces?" The answer is, they don't know.

mikehearn commented 7 years ago

OK, I'll play the devil's advocate. The arguments so far are:

  1. Do it because someone else will do it anyway, so it is inevitable.
  2. Do it because people will be able to find ways to protect themselves.

I find neither argument persuasive.

The first is a common fallacy amongst engineers, which is to assume an infinite quantity of infinitely skilled and infinitely motivated attackers, thus making any imaginable attack inevitable and unavoidable. This is what I call the God threat model and it's not actually reasonable. I used to work for Google on anti-spam and anti-hacking. This was an eye opening experience. Our biggest wins often came from exploiting simple mistakes by attackers that advanced programmers would not have made, but that was OK because virtually all adversaries were neither especially advanced nor especially motivated. Simple tricks that other engineers wrote off as "that will last five minutes" ended up lasting years ... or more. That's because the people with the advanced skills could get well paying, morally satisfying jobs in industry. There was no need for them to achieve fame or money by dicking over other people.

So. The number of programmers in the world is quite large. The number of programmers in the world that have cutting edge machine learning and big data experience is not large at all, hence the current hiring frenzy in the Valley to try and acquire them all. One day writing an app like this might be as trivial as writing "My first to-do list", but we're not there yet, not even with the latest APIs. Until that day comes, the number of people who actually combine:

to build something like this and then give it away for free is likely to be much smaller than you might think. If you're talking about a user friendly version that anyone could use, shrink the candidate pool by another 10 or 100x.

There is no guarantee that if you pass on releasing the code, an identical or even better version will appear tomorrow. That might happen, but equally it might not. So why have it on your shoulders?

The second argument is even weaker still, for three reasons:

  1. Hardly anyone will know this tool exists.
  2. Of those who know, most of them will not know how to adjust their writing to miss it, and it's very unlikely to be easy to learn or for a different tool to come along and help.
  3. Of those who somehow overcome those two barriers, they will not usually be able to find and edit all their old posts to re-scramble them.

Thus releasing your tool is unlikely to have any real protective effect even amongst people who know of its existence, which 99.99% of people on the internet won't.

Hence, you face a simple moral choice: release a tool which you admit has no ethical uses, and be That Guy who enables a whole lot of drama and potentially even misery. Or don't, and let someone else be That Guy .... which may or may not ever happen.

EthanoicPromethium commented 7 years ago

Releasing it is the better decision. Here are my reasons:

If you oppose open sourcing this please explain the threat this will pose to citizens and journalists. Keep in mind though that most state agencies will already have such a system in place.

manly commented 7 years ago

Say you have made the first OCR, ever. Many people see a use for it. But someone gets the idea that if instead of training it on characters, you train it on faces, and then now you've just released code for a crude face recognition algorithm if you train it with something else.

I think you're framing what it can do too strictly. The way I see it, you have a clustering algorithm that works on text ensembles. I can imagine many uses for that. For example, it could be used for :

In any case, the part about training based on document sources from identified people is kind of arbitrary. The core is just a document classifier, which is not the first. Karpathy has released source for searching across arxiv and finding near neighbors/by similarity, which essentially sounds a whole lot like what you have here. It's all about what you train it on; in this case it's multiple documents being loosely given a style to learn on, which you use to extract identity. Train it on something else, and it's perfectly within good uses with no ethics or morals bending.

edit: https://github.com/karpathy/arxiv-sanity-preserver finds similar documents by tf-idf, which is essentially a classifier, which is what you do

edit2: upon further thinking, you could also implement the same algorithm with minimal change if you use bayesian spam filtering code. A bayesian filter could easily be reworked to be a classifier than has more than 2 outputs (typically: spam/not spam), giving you a most likely candidate after applying a softmax() on the resulting outputs.

igorbrigadir commented 7 years ago

With a well trained network, the accuracy is remarkable, > 95%

Interested to know more details: what exactly is in the training & test sets, & how does it compare to other Author Identification approaches.

If user privacy is a concern, you can still release code & omit the trained models.

palkeo commented 7 years ago

This is already a well-studied (and very interesting) subject : https://en.wikipedia.org/wiki/Stylometry And you can already find tons of papers on the subject. And software: https://github.com/jpotts18/stylometry

Have you compared your results to the state of the art, on a known dataset ?

So I think that this is a non-issue, given all what exist on the subject… Even if you are better than the state of the art, this is nothing new…

mapguy109 commented 7 years ago

"Half the electronic engineers in the galaxy are constantly trying to find fresh ways of jamming the signals generated by the Thumb, while the other half are constantly trying to find fresh ways of jamming the jamming signals."

Encourage the development of jammers for the jamming signals by giving anons something to work off of...

aminorex commented 7 years ago

One very important use is the identification of the corporate lobbyists who wrote bills before congress.

WebmasterGrumpy commented 7 years ago

Can you release your tool on Shakespeare and determine who wrote what?

jacobwgillespie commented 7 years ago

As you mentioned law enforcement and state actors likely already have access to this kind of technology. Not all of those actors are benevolent. Open sourcing this helps level the playing field and allows the open source community to take part in the discussion and shape the future that is already here.

tl;dr - release it

3even commented 7 years ago

Release it.

matthewschallenkamp commented 7 years ago

It's not that hard to do and it's potentially a very popular service. By being first, you can at least try to set the standard in terms of ethics.

This is the exact reason you ought to release this. As the owner of the repository and the first to publish you will get to set the standard and the precedent of ethical uses of this technology. If you don't, it is likely that whoever eventually does will be far less ethically minded than you have shown yourself to be.

Beyond that, the release and publicity surrounding this (however little) will help prompt the development of software that can defeat this technology, thereby helping remove some of the major ethical concerns that you rightfully have regarding the removal of anonymity.

wltsmrz commented 7 years ago

I think it would be cool to have a complimentary tool to produce disguised text from the input, which the algorithm would not have tagged as being authored by you. A "de-authoring" tool.

Just release both and everyone is happy :)

benoit-canet commented 7 years ago

Fake reviews are a plague please release it.

zbrdge commented 7 years ago

Much like any publicly available "offensive security" tool, the benefits of being able to assess one's own risk profile wrt the attack(s) it enables probably outweigh the hazards of its' release.

formvoltron commented 7 years ago

Don't release it. Sell it it as a service.

Massive companies are built on top of free software & those companies are only getting bigger & stronger. Don't enable them. Make money & do battle.

bradydowling commented 7 years ago

By nature, programmers are curious so I don't know if a vote is the most effective way of settling this (as the audience you present the vote to is quite biased). I'm sure you're aware of that though and you're considering the arguments used in each case.

That said, someone else potentially coming up with the same thing in the future is not a good reason for releasing it. Moral decisions often involve situations where you must stand alone even in circumstances when it seems like it doesn't matter.

rscircus commented 7 years ago

A difficult set of questions. Maybe you are endangering life here.

There would be the possibility to:

@palkeo mentioned in https://github.com/mlpoll/machinematch/issues/1#issuecomment-236037043:

Stylometry:

Stylometry grew out of earlier techniques of analyzing texts for evidence of authenticity, authorial identity, and other questions.

where

Modern stylometry draws heavily on the aid of computers for statistical analysis, artificial intelligence and access to the growing corpus of texts available via the Internet.

So, kudos for your ethical move here, as it seems you are unaware of Stylometry (as I was too a few minutes ago).

Probably, the procedure above makes sense nevertheless, as an open source project lowers the barrier of entry.

wrq commented 7 years ago

ABSOLUTELY. Please release this! I really want to improve on it, and use it for counter-terrorism purposes.

webmasterraj commented 7 years ago

"Someone's going to do it anyway" is a terrible argument from moral grounds (which I think this poll is about). It reduces everyone's behavior to the lowest common denominator level -- in game theory terms, if that premise is true, everyone should act at the same level of the person willing to do the most harm.

It's not even necessarily guaranteed that someone else actually will release this. I agree with mikehearn on this point. Too often, the belief someone else will do it rests on a blanket assumption that anything that can be done will be done.

If none of that convinces you, let me ask one last question: what was the false positive rate in your test set? Even worse than doxxing someone is incorrectly doxxing someone. Can you imagine waking up and finding out the Internet is convinced you are the moderator of r/pedophilia? Even with a low single digit FPR, that could be a lot of people's lives you're affecting.

Don't do it. Don't be that guy.

wrq commented 7 years ago

Yeah, but 95% percent of the time, we'd be able to find that pedophile moderator. Could you look those victims in those eyes and tell them that you weren't willing to let the dogs loose because you were afraid they might bite the wrong guy? Could you look a bunch of kids in the eyes and tell them that saving them wasn't worth the risk of not saving them?

If this thing works 51% of the time, it's absolutely worth it.

ghost commented 7 years ago

Law enforcement probably possess such tools already.

This. Companies dedicated to mining personal data too (Google, Facebok et al). The level of sophistication of those tools must be way beyond any privacy advocate worst nightmares.

Good obfuscation can only be archived testing results against a good adversary. This tool can take that role. Please release it.

jkinz commented 7 years ago

I've found that the same can identification can be done manually by using google. It's certainly more effort to do so but still fairly easy. Even down to the spelling/typos tracking.

So if you release the automated stalking tool, you make it easier to DE-anonmyze some people, but its likely they were already unmask-able.

IreneKnapp commented 7 years ago

Fundamentally, the world needs a little more time to prepare. There's at least a chance that this will serve as a wake-up call to get people working on tools to counter it. I mean, I'm dubious of that as I say it, but...

acgh213 commented 7 years ago

Someone else will release something like it within the next few years. So why not release it now? I don't see why this shouldn't be released. Once it is out there, it will be out there. People can take it, improve on it, or it can just die off. That's how these projects work. Someone may find usefulness in it, and others may see it as an annoyance.

But since you asked the question of "Should this even be released", i think the answer is yes. You should release it, doubt will subside. Releasing this to the public could be very beneficial , or it could just be added to the already extensive list of tools people use to doxx others. Since other means of doing something like this exist, I don't see a reason why you wouldn't release this.

vertis commented 7 years ago

@JimmyRowland I'm going to hazard a guess that you've not actually spent any time in China or you wouldn't make stupid statements like that. That and your complete lack of your own projects on GitHub mean that you're completely unqualified to comment.

niarbeht commented 7 years ago

If it already exists in darkness (three-letter law enforcement or any other government agency with craptastic ethics/morals), then it should exist in light, too (publicly-available code). The reason is that you can't fight against something that isn't understood.

vertis commented 7 years ago

@JimmyRowland I know and work with plenty of developers in China that do original work, and are quite capable of doing deep learning work. Suggesting that the Chinese Government would not have access to similar people, is just incorrect.

rzj commented 7 years ago

are you able to release more information on the performance of the algorithm? or at least more information on how you're doing the training/validation?

i'm an active stylometry researcher and tbh, > 95% on test set sounds incredibly high for internet scale applications. the recent megaface competition illustrates the difficulty of scaling these kind of techniques to a large number of classes. (in that case faces, in this case authors).

as a scientist, i naturally lean towards 'release', but some kind of validation would be informative. email to discuss further if interested; i have some ideas for external validation methods

dratlas commented 7 years ago

Following on from rzj's comment above, a major consideration is the scale of which this is capable. There are already some automated stylometry tools online (for example: http://www.aicbt.com/authorship-attribution/online-software/) but they're mostly at the level of "check author X against author A". A mass tool would be significantly different.

As for obfuscation, have you tried this with deliberate obfuscation? Not knowing exactly why a neural network produces the results it produces is not quite the same as not knowing at all what it could possibly be working with. As others have pointed out, stylometry is a research field, and the elements that distinguish authors are not unknown. It would be worth checking your algorithm against known authors using known stylometric changes to see whether it's robust against those. If not, then defence by normal stylometric obfuscation is possibly adequate, and has the advantage of defending against all stylometric analysis.

igorbrigadir commented 7 years ago

For anyone interested in this kind of thing, have a look at the PAN workshops on 'Uncovering Plagiarism, Authorship and Social Software Misuse' http://pan.webis.de/tasks.html

There are implementations curated by @pan-webis-de here: https://github.com/pan-webis-de & all the papers describing the systems here: http://pan.webis.de/publications.html

The best performing system (in Author Identification) in 2015 was by @douglasbagnall https://arxiv.org/pdf/1506.04891.pdf and used a character-level neural network language model, code here: https://github.com/pan-webis-de/caravel

"deep learning techniques" used by MachineMatch is vague, so it's hard to say if there's any similarity with the approaches, and the sets of documents used are different, so i'd love to see how both perform on the same task with the same data.

Also:

I hope, given the existence of such a tool, someone makes an "identity obfuscation tool"

Check out this task: http://pan.webis.de/clef16/pan16-web/author-obfuscation.html

josephrocca commented 7 years ago

More info on the >95% accuracy would be great, as others have suggested. Saudi Arabia is regularly unmasking and killing anonymous bloggers/tweeters/etc. who are atheist/gay/etc.. I'm not sure whether you should release it or not, but if you do and your software is as accurate as you say and it's easy to set up and use, then you're probably going to cause some good people a whole lot of trouble. Perhaps you should talk to some human rights organisations and people like @igorbrigadir - hopefully they have thought about this sort of thing and have good policies.

derram commented 7 years ago

Can't wait for the admins to explain how this doesn't have anything to do with online harassment after labeling repositories used to auto archive posts as such.

douglasbagnall commented 7 years ago

Without any information about the actual tests, it is impossible to know whether the claimed 95% accuracy is unimpressive, implausible, or somewhere in between.

If the “95% accuracy” is distinguishing which of two possible authors wrote a rather long text, the claim is unremarkable. But if it refers to picking the author of a short text from a language community of hundreds of millions, the claim is ludicrous. Between those extremes there are obviously tasks where something you could call "95%" would be an interesting result. As @igorbrigadir mentions, independent verification via something like PAN is useful. Entering the competition does not entail publishing your code if that worries you.

As to the actual question: it seems very unlikely that you will have discovered anything that is orders of magnitude better than state powers already have (especially, I'm sorry to say, if your novelty claim is "deep learning techniques"). The identifying information in writing style is actually quite limited. You can't magically extract more. In all likelihood we are further from the asymptotic limit than we are on problems like text compression, but the limit is plainly there. All improvements are now incremental. It doesn't really matter.

terraboops commented 7 years ago

State powers and clever hackers already have tools like this. Releasing this project will mean that more amateurs will have access, including those who wish to evade tools like this (in the name of good or evil). Those people will be able to use tools like this - and those that follow it - to test their anonymity and refine their techniques.

At first, I was against this. After some thought and discussing it with a person who has been threatened online many times due to their beliefs, I think it should be released.

This is a codified technique for violating privacy. If released, it can be used for evil -- but it can also be used for good. If not released, it will still be sought (and likely used) by evil people for evil things.

jcnewell commented 7 years ago

You should release it - if only to allow people to develop and evaluate software which provides countermeasures.

mdosseva commented 6 years ago

So what ever happened with this anyhow ? Votes say yes.