openai / requests-for-research

A living collection of deep learning problems
https://openai.com/requests-for-research
1.69k stars 608 forks source link

Questions on the "Spam the Spammers" Problem #12

Closed tra38 closed 8 years ago

tra38 commented 8 years ago

I don't know where I should put this. It's not a PR because I haven't build anything yet. I want to clarify some points first and lay out my thoughts and procedures before I decide whether to actually build this thing. It is possible that I am completely missing the point of this challenge.

So my questions are:

  1. Why should machine learning be seen as appropriate venue for this problem? After all, spammers are able to write up a corpus of comments properly that are effective at fooling a few bloggers. Why not just write up a corpus of prepared emails ("that's a good email, but I have one minor question...") that can be sent instead? Relying on a machine learning algorithm means you have to teach the machine algo how to write, which seems like a rather time-consuming process (and possibly error-prone). The "Request For Research" links to the paper Exploring the Limits of Language Modeling, which had generated pretty good sentences (so long as you exclude the "short and politically incorrect" sentences)...but there doesn't seem to be any sort of larger-scale organization involved in sentence generation. This is...problematic. Without any larger-scale organization, it's just a bunch of random sentences that seem rather useless. I think this issue could probably be alleviated by writing a pre-written corpus of email replies and then training the ML algorithm on that corpus, but I'm afraid that the results may not be fairly pretty...and since you lack large-scale structure, I feel that the resulting emails would be very incoherent babble. If you could convince the ML algorithm to generate random sentences that appear to fit a certain "theme" though ("I am interested in your topic, tell me more!")...then it might work, but this would still be difficult. TL;DR: Using ML algorithms on this problem seems to be equivalent to using a toothpick to dig a hole in the ground, especially when people have already discovered shovels.
  2. Would a complete reliance on apophenia count as a "new idea" that is used to solve this problem? Apophenia is the tendency of humans to see patterns where none exist. The goal would be to produce an email reply that appears to respond to the spammer's email, and so the spammer would assume that the email is indeed legitimate...even though there is no real reason to assume the email is thus legitimate. Taking advantage of apophenia is a useful way for helping machine-generated text go farther, but it doesn't really help this organization succeed in producing 'smart' machines (after all, the machines doesn't know what it's writing, much less what it's responding to.)

Don't get me wrong. I am somewhat interested in this problem, and I think it's a solvable one, but I think the solutions I have in mind may not be the solutions you would prefer to see.

gdb commented 8 years ago

cc/ @wojzaremba (author of that problem!)

wojzaremba commented 8 years ago

@tra38 Thank you for your comment. Most of points you made are extremely valid.

  1. Spammers need to communicate a given, specific message. It's difficult to generate large amount of text that is fully coherent, and which would convey spammer's advertisement. This problem is extremely difficult, and is close in complexity to building a full fetched dialog system.

However, it's easy to generate meaningless messages that at first look reasonable. That's what language models do. It's hard to automatically detect such messages, because they match distribution of real emails (you can even use adversarial networks to generate such emails).

  1. I believe that techniques that really on
  2. data
  3. statistical machine learning are the way to defeat spammers, rather than to hit them once. Solution that is hard coded has to be constantly updated, because spammers will automatically filter such emails. Only emails that cannot be easily recognized will require their labour, and can lead to the final defeat. That would mean that email has to have all signs of being original:
  4. text
  5. message
  6. email address
  7. ip address of a server
tra38 commented 8 years ago

@wojzaremba, thanks for your reply. Knowing that the goal is to produce an email with apparent sophistication rather than one that is fully coherent, I can better understand what you're aiming for. Using adversarial networks could probably work in that regard (though trying to apply them to text seems like a difficult problem in and of itself).

I assume that since you have retracted the Request for Research (due to its complexity and reliance on other fields), that this proposal is dead in the water. But I suppose it's still useful to think about what is already possible with current technologies.