Open nb333 opened 10 years ago
Hum hum. In JavaScript or Python? If with Python we'll need to use a DOM addon (I pretty don't know what) and do a search within <body>
, with predefined ad classes.
@ZDroid Correct. With JavaScript being client-side, we would have to remove the ad once we've already fetched or prevent the call to get the ad. Instead, if we chose Python (server-side), that allows us to remove the ad code before it ever gets to the client.
Yeah, but Python one would be super-super complex.
@nb333 @ZDroid Sorry, I have been really busy lately with school and work. @ZDroid if you can sent me the detailed requirements, I can try to help with this issue.
@arunenigma Don't worry, it's New Year! Relax... :)
Details: OpenFaux server should get ads and remove them, then OpenFaux client renders page and opens it without ads.
Hmm..privoxy has something similar, we should be able to emulate that in python.
Ok.
2014/1/1 Yashin Mehaboobe notifications@github.com
Hmm..privoxy has something similar, we should be able to emulate that in python.
— Reply to this email directly or view it on GitHubhttps://github.com/openfaux/openfaux-server/issues/30#issuecomment-31424568 .
Zlatan Vasović - ZDroid
A big portion of ad services we'll be removing are run through major services (such as google ad sense) and will have a streamlined implementation we can be looking for. If we wanted to it'd be as easy as running the code through a regex and stripping out anything that matches. Personally I'd prefer to use a DOM handler so we know the objects are preserved as expected then we can just run attributes of the elements the DOM generates through a regex.
Just to let you guys know, regex cannot be used to parse HTML. HTML is not a regular language. You need to parse the HTML first and then possibly use regex (although probably not required after parsing HTML). Shouldn't be a problem if done server side though because Python comes with a HTMLParser class in its standard library.
Lawl lawl lawl. I said load HTML and then search it. :D
Problem is that Python doesn't love HTML too much.
2014/1/8 Michael Ma notifications@github.com
Just to let you guys know, regex cannot be used to parse HTML. HTML is not a regular language. You need to parse the HTML first and then possibly use regex (although probably not required after parsing HTML). Shouldn't be a problem if done server side though because Python comes with a HTMLParser class in its standard library.
— Reply to this email directly or view it on GitHubhttps://github.com/openfaux/openfaux-server/issues/30#issuecomment-31797889 .
Zlatan Vasović - ZDroid
@boxtown HTML is a regular language, to be exact it's a markup language, yes it's syntax is different from a programming language but it's still a standardized language.
@ZDroid if you're saying we render the HTML then search it, we don't want to do that either.
HTMLParser should do the trick, if it's anything like the built in parser for JS then we can just search for all of element type x with class y and remove it/them. Will just require a bit of research on our part to find the common elements between ads generated by the different ad services. It may help to check out the source code for adblock (https://hg.adblockplus.org/adblockplus/) since they do this already, although their service is client-side. It's possible adblock uses a different method we haven't thought of that might work better, same goes for any other service of this type.
The only thing I'm worried about with removing HTML elements is that it may destroy the flow of the page, in which case maybe there's a way we can just unlink all of the files that are required for the ad, so if it generates the ad through some JS, remove the JS include, if there's an image associated with it, remove the image and so on so it never grabs the resources but the element is still there and (depending on how the ad service implements) still filling the space, just with empty space now. Ideas?
Wouldn't building a blacklist of ad elements help? Check if any of them exist in the browser contents and then remove it altogether?
We don't want to accidentally break someone's layout by just blindly removing the elements, but a blacklist will be needed. Instead of blacklisting a
<div class="ad">...</div>
element instead we can focus on the part that will actually impact the user's experience, such as removing the
<script src="getYoAdHere.spam/..." />
that will actually be making requests out so when the element renders it'll keep it's styling that was added and the div element so it shouldn't break the flow of the page but since it's never grabbing the script to fetch the image it'll never actually render anything more then some black space. This will also help with those pesky sites that have JS built int to overlay a ad that you have to click close on before you can see the content.
Alright. Once we figure out what to remove, the actual removal should be fairly trivial. We just modify the buffer in the proxy accordingly. Parse the HTML content using Beautiful soup or lxml (faster?) and then remove the element.
Agreed, we'll just have to find the common culprits and create a blacklist for it.
Our Ad Blocking will be similar to AdBlock's service, but ours will be server-side. Thus, we will strip out the ad so it's never even sent to the user. :D