vejobuj / greasefire

Automatically exported from code.google.com/p/greasefire
0 stars 0 forks source link

Possible to remove the "global" scripts that apply to all pages? #5

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What about adding an option (or perhaps do this always) that removes the
scripts that apply to every single page. 

Usually they are just something uploaded to userscripts with an error in it. 
They pollute the search. Sample: http://userscripts.org/scripts/show/58553.

How to detect? 
How about just added a couple of dummy/bogus urls that you test against and
if a script matches those then it is too widely applicable and removed from
greasefire?

e.g. if the script matches "http://mydummygreasefirelink",
"https://mydummygreasefirelink", or to put it as a regex
"https?://mydummygreasefirelink(\....?)?

then it should be disqualified. 

Original issue reported on code.google.com by mr.soere...@gmail.com on 7 Oct 2009 at 1:31

GoogleCodeExporter commented 9 years ago
Yes I totally agree -- I've been thinking about adding this to the code the 
builds the 
index.  Perhaps test each @include expression against a set of URLs, and if it 
matches 
too many then just exclude it.  It sure beats the current hand curated system!

Original comment by skrulx on 7 Oct 2009 at 2:42

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
please add a filter to exclude ^(https?://)?\W*\w{0,4}\W*$ or something like 
that,
the current list still lets too much crap go through...

Original comment by amebo...@gmail.com on 11 Mar 2010 at 6:52

GoogleCodeExporter commented 9 years ago
how about this: in
http://code.google.com/p/greasefire/source/browse/java/greasefire-scraper/src/co
m/skrul/greasefire/Generate.java

in line 148, add:
Pattern badIncludePattern =
Pattern.compile("^(http(s|\\*)?://)?(www\\.)?\\W*(\\.com/)?\\W*\\w{0,4}\\W*$");

in line 174, replace the condition for:
badIncludePattern.matcher(value).matches()

and delete the references to the badIncludesList (you could also use both, but 
the
regexp already matches everything in the list...)
i haven't tried or even compiled this, but it should work and prevent you from 
having
to update the list ever again

Original comment by amebo...@gmail.com on 11 Apr 2010 at 10:43

GoogleCodeExporter commented 9 years ago
I definitely agree, that the blacklist should be updated. The pattern is a 
great idea. I quickly tested (not in Java though, but with 
http://gskinner.com/RegExr/ ) and it matches all items on the current black 
list, except "*.com*.com/", which looks like a typo to me anyway, I guess it 
should be two separate entries "*.com" and "*com/", which are then again 
matched by the pattern.
If you don't want to use the patter, could you at least update the black list 
with the following entries (they are the ones which I noticed at at turning up 
way to often): "*://*/*", "*.*.*/*" and "*org*" (all of those are matched by 
the Pattern as well).

Original comment by fro...@gmail.com on 14 Feb 2011 at 12:44

GoogleCodeExporter commented 9 years ago
I just checked and apparently the blacklist entry for "*.com*.com/" is 
necessary, otherwise this script will appear as available script on almost 
every website: http://userscripts.org/scripts/review/8485 
In order to catch this entry as well, the pattern needs to be:
Pattern.compile("^(http(s|\\*)?://)?(www\\.)?\\W*(\\.com/?)?\\W*\\w{0,4}\\W*$");
(notice the extra question mark after "\\.com/" ).
I just generated my own index and it works great. I needed to make some other 
adjustments to the code as well, as I was getting some error messages. 
Anyway, the patch-file of the changes is attached. I hope this gets fixed soon, 
even though I have my own index for now, but this is not the solution for 
everybody.

Original comment by fro...@gmail.com on 18 Feb 2011 at 7:43

Attachments: