Open GoogleCodeExporter opened 9 years ago
Yes I totally agree -- I've been thinking about adding this to the code the
builds the
index. Perhaps test each @include expression against a set of URLs, and if it
matches
too many then just exclude it. It sure beats the current hand curated system!
Original comment by skrulx
on 7 Oct 2009 at 2:42
[deleted comment]
please add a filter to exclude ^(https?://)?\W*\w{0,4}\W*$ or something like
that,
the current list still lets too much crap go through...
Original comment by amebo...@gmail.com
on 11 Mar 2010 at 6:52
how about this: in
http://code.google.com/p/greasefire/source/browse/java/greasefire-scraper/src/co
m/skrul/greasefire/Generate.java
in line 148, add:
Pattern badIncludePattern =
Pattern.compile("^(http(s|\\*)?://)?(www\\.)?\\W*(\\.com/)?\\W*\\w{0,4}\\W*$");
in line 174, replace the condition for:
badIncludePattern.matcher(value).matches()
and delete the references to the badIncludesList (you could also use both, but
the
regexp already matches everything in the list...)
i haven't tried or even compiled this, but it should work and prevent you from
having
to update the list ever again
Original comment by amebo...@gmail.com
on 11 Apr 2010 at 10:43
I definitely agree, that the blacklist should be updated. The pattern is a
great idea. I quickly tested (not in Java though, but with
http://gskinner.com/RegExr/ ) and it matches all items on the current black
list, except "*.com*.com/", which looks like a typo to me anyway, I guess it
should be two separate entries "*.com" and "*com/", which are then again
matched by the pattern.
If you don't want to use the patter, could you at least update the black list
with the following entries (they are the ones which I noticed at at turning up
way to often): "*://*/*", "*.*.*/*" and "*org*" (all of those are matched by
the Pattern as well).
Original comment by fro...@gmail.com
on 14 Feb 2011 at 12:44
I just checked and apparently the blacklist entry for "*.com*.com/" is
necessary, otherwise this script will appear as available script on almost
every website: http://userscripts.org/scripts/review/8485
In order to catch this entry as well, the pattern needs to be:
Pattern.compile("^(http(s|\\*)?://)?(www\\.)?\\W*(\\.com/?)?\\W*\\w{0,4}\\W*$");
(notice the extra question mark after "\\.com/" ).
I just generated my own index and it works great. I needed to make some other
adjustments to the code as well, as I was getting some error messages.
Anyway, the patch-file of the changes is attached. I hope this gets fixed soon,
even though I have my own index for now, but this is not the solution for
everybody.
Original comment by fro...@gmail.com
on 18 Feb 2011 at 7:43
Attachments:
Original issue reported on code.google.com by
mr.soere...@gmail.com
on 7 Oct 2009 at 1:31