giulionf / realtweetornotbot

Scans screenshots of tweets on reddit and links the corresponding tweet
https://www.reddit.com/u/realtweetornotbot
MIT License
76 stars 5 forks source link

Train a Neural Network to detect if an image is a tweet #57

Closed giulionf closed 5 years ago

Shivam60 commented 5 years ago

Any dataset or any idea how to create such a dataset?

giulionf commented 5 years ago

I have written Code to crawl the Dataset. I can Offer a package with 20000 Twitter Screenshots and 20000 normal "meme" Images as negatives. If needed more could ne crawled! Would that be sufficient?

Shivam60 commented 5 years ago

I think that it is more than enough. Could you mention the image size and maybe show an example of each class?

giulionf commented 5 years ago

Image sizes are variable. The solution would require a downscale/variable input!

Examples for tweet class: 3210733 183275398908481536

Examples of non tweet class: bad (11274) bad (11334)

Shivam60 commented 5 years ago

okay, I am interested in solving this. Can you share the dataset?

giulionf commented 5 years ago

Sure, I will upload the data and send it to you once it's done! Just remember that this is not an open dataset and can not be used for anything but this issue without my permission!

Shivam60 commented 5 years ago

cool, got it. should I push my work in this repo?

giulionf commented 5 years ago

@Shivam60 I'm thinking about seperating it into a second repository with the rest of the code (Dataset generator). Or do you want to do it via a pull request? I think it would be cool to have it usable as a seperate python module that the bot includes!

Shivam60 commented 5 years ago

Well trying a resnet didn't help. Any ideas?

giulionf commented 5 years ago

Do you have conv-layers in the net? I just thought about it... Maybe experimenting with the kernel size would yield better results? What exactly did you try?

I remember when I tried this I used keras with 2 conv. layers of the size 64 and 32... Maybe we should think about how we pre process the images... How did you deal with the variable input sizes in your approach?

Shivam60 commented 5 years ago

Well, I Resnet50 requires 224 by 224 shape so I set the target size in my image generator for that. So all images were resized to it. I normalized the dataset but still didn't get much accuracy. I suspect that CNN look for patterns which repeat themselves and in these images there is none. Might we try to extract text from images and look for them in twitter?

giulionf commented 5 years ago

Thats what Im doing atm. I'm extracting text using tesseract, which serves as a bottleneck to the rest of the application. The bot runs on heroku which only allows 512 MB of Ram, so I had to put a lock onto the tesseract operation. The whole thing could be severly speed up by a NN that detects if the image is a tweet before doing OCR...

I had the same results as you did with my CNN approach... I guess my problem was that the aspect ratio of the tweet was not kept and thus the Convultional Layer could not get things like the circle of the Avatar or similar stuff....

giulionf commented 5 years ago

Issue closed since it seems too complex for what it‘s worth.