Image capturing - avoid ads [low priority]

rkorach commented 12 years ago

Hi,

When looking for best image in a page, we have to avoid ads at all cost. As soon as we think we have a good candidate for an image, we could look for specific words in the url, to indicate if it is an ad. Example: 'ad', 'ads', 'advert', 'advertisement', 'advertiser', 'annonce', 'publicité', 'adsense', 'promoted', etc. I think there also are some predetermined ad sizes that will give us a good insight. (or we could look at adBlock's code if it is not somehow protected).

fwouts commented 12 years ago

Good point.

Do you want to implement this, as an exercise? :)

A few hints:

The code should be added around line 206 of behavior/passive/parser.js: https://github.com/fwouts/SubZoom-Proto1/blob/master/behavior/passive/parser.js#L206
You can use the src variable, already filled, to analyze the URL.
You can use src.indexOf('what I am searching for') which will return -1 if 'what I am searching for' is not present (or you can use regular expressions, which are a bit more complicated to learn but much much more powerful).

rkorach commented 12 years ago

Excellent idea.

I'm scared as hell and it might take some time, but totally up to it. Thanks for the hint I'll get back to you guys if I need help (that google couldn't provide).

fwouts commented 12 years ago

No problem. When you commit your code, please add #11 in the description, so that it gets automatically linked to this issue. It will be easier for us to see what exactly you changed, and add comments on your changes if necessary!

rkorach commented 12 years ago

Did it otherwise: images with outgoing links (not staying in the site) must be ads > we don't consider them anymore. Also added a restriction on the size of the image we pick (we don't want a too small image, do we?)

Please review the code, and feel free to correct if practices are not good.

fwouts commented 12 years ago

I'm afraid your solution about outgoing links is too strict. A lot of websites use different domains for static content like images, especially big websites (e.g. Facebook, Google, and probably many news websites). That's because using different domains makes it easier to use completely separate servers (DNS servers redirect to another IP before it hits the frontend server). Was your idea about filter keywords like 'ad' not efficient enough?

fwouts commented 12 years ago

I didn't read the code correctly, I thought you were filtering the source of images but you are actually filtering the source of wrapping links, which is much better! There can still be problems with big websites though, but let's try keeping that for now.

rkorach commented 12 years ago

Indeed, some problems occurred on pages like http://techcrunch.com/2012/05/23/yahoo-axis-search-browser/, or http://steveblank.com/, where relevant images link are on a secondary image-hosting site. As I considered that an image with an external link was advertisement (either an ad provider, or a partnership with another site to drive traffic), these wouldn't be detected. Corrected it by considering that the image-hosting site has similarities in the url. techcrunch.com > http://tctechcrunch2011.files.wordpress.com steveblank.com > http://steveblank.files.wordpress.com I just take the website name in www.website_name.com and see if it appears in the image link. Taking the website name can be done much more accurately with this code https://github.com/lbolla/junk/blob/master/utils/regdomain.js, found here http://lbolla.info/blog/2011/04/05/get-registered-domain-in-python-and-javascript/. But is it a problem to add some big files like this to our code for small features like this? (I mean, I'm not really trying to improve the extension here, but to train myself to code ^^)

rkorach commented 12 years ago

Actually at some point I think we'll have to consider "alt" attribute, (as well as the name of the image - often nameOfImage.jpg or .png) to see if the image is the most relevant in the page (compared to the page title for example, AND to the story we want to put it in)

moutard commented 12 years ago

Put that in Asana.

moutard / SubZoom-Proto1

Image capturing - avoid ads [low priority] #11