strider72 / spam-karma

A flexible and modular anti-spam plugin for WordPress
GNU General Public License v2.0
3 stars 3 forks source link

Bug in the domain-parsing regex #1

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
'.org.ua' doesn't get recognized as a proper TLD, causing it the whole TLD to 
get blacklisted as one 
if a spammer uses 'domain.org.ua'...
The domain extraction regex needs to be updated.

Overall the exhaustive approach used by the URL domain-parsing regex (used to 
extract remove 
subdomains while keeping only domains and TLDs from URLs) probably needs a bit 
of dusting off. 
Either to make sure the TLD list is up-to-date or make the approach a bit more 
flexible to new 
TLDs.

Original issue reported on code.google.com by zedrd...@gmail.com on 16 Jul 2008 at 7:19

GoogleCodeExporter commented 9 years ago

Original comment by zedrd...@gmail.com on 16 Jul 2008 at 7:19

GoogleCodeExporter commented 9 years ago
Mozilla maintains a public list of all TLDs.  Should we just check against that?

http://publicsuffix.org/

Original comment by stephen....@gmail.com on 21 Jul 2008 at 3:02

GoogleCodeExporter commented 9 years ago
FYI -- I plan on updating this from the Mozilla list, but the page is currently 
down.

Original comment by stephen....@gmail.com on 5 Jun 2010 at 10:05

GoogleCodeExporter commented 9 years ago
Update: Going to be a bit more complicated tha simply updating the existing 
list. The current list from publicsuffic.org is over 3,000 entries long, and 
that includes some wildcards!  So rather than passing a massive PHP array, I 
think we'll have to create & populate a MySQL table and check against that.  Of 
course that also means keeping said table updated....

Original comment by stephen....@gmail.com on 30 Jul 2010 at 4:05

GoogleCodeExporter commented 9 years ago
Removing myself as Owner for this.  I don't know well enough the proper way to 
handle the length of the updated complete TLD list, but I'm pretty sure we 
can't pass a 3,000-item array in PHP without breaking something.

This is an important one though, and I would appreciate somebody more skilled 
picking this up.

Keep in mind that in the long run we also need some means of keeping the list 
updated.

(Also changing from priority-medium to priority-high)

Original comment by stephen....@gmail.com on 11 Jan 2011 at 11:08

GoogleCodeExporter commented 9 years ago
The Internet landscape is getting more complicated.  With the new wave of 
basically infinite arbitrary TLDs on their way -- e.g. ".media" --  I'm not 
sure if it will be possible to parse this anymore.

Unless... perhaps the new TLDs are all single-dot, in which case we may 
theoretically be able to check against a list of known double-dot TLDs -- e.g. 
".co.uk" -- and just assume that in all other cases, whatever's after that dot 
is the TLD?

Original comment by stephen....@gmail.com on 22 Nov 2013 at 10:27